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MECHANISMS  AND  CHEMOPREVENTION  OF  OVARIAN  CARCINOGENESIS 

PROGRESS  REPORT 


INTRODUCTION 

Due  to  its  asymptomatic  development  and  frequent  diagnosis  at  advanced  stages,  ovarian 
cancer  is  the  most  deadly  among  the  gynecological  cancers.  A  better  understanding  of  the  early 
molecular  events  leading  to  the  disease  is  of  utmost  importance  for  the  development  of  strategies 
for  its  efficient  early  diagnosis  and  prevention,  which  could  improve  patient  survival  and  quality 
of  life.  We  have  shown  that  DMBA-induced  mutagenesis  in  the  rat  ovary,  combined  with 
gonadotropin  hormone-mediated  enhanced  mitogenesis  of  the  ovarian  surface  epithelium  gives 
rise  to  lesions  ranging  from  preneoplastic  to  early  neoplastic  and  advanced  ovarian  tumors, 
which  resemble  the  human  disease.  The  goal  of  the  study  is  to  use  this  animal  model  to  study  the 
molecular  mechanisms  behind  ovarian  oncogenesis  and  to  conduct  a  preclinical  trial  for  its 
chemoprevention.  The  aims  of  the  study  are:  1)  Determine  the  molecular  genetic  mechanisms 
behind  ovarian  oncogenesis  in  the  DMBA/gonadotropin-animal  model;  2)  Determine  the 
efficacy  of  a  COX-2  inhibitor  to  prevent  the  appearance  and/or  progression  of  DMBA-induced 
ovarian  lesions;  and  3)  Study  the  in  vivo  mechanisms  of  the  putative  chemopreventive  effect  of 
COX-2  inhibition.  Genomic  and  mutation  analyses,  as  well  as  other  molecular  biology  assays 
will  be  employed  to  accomplish  the  objectives  of  the  study. 


BODY 

During  the  first  year  of  support  by  this  DoD-CDMRP  grant,  we  have  accomplished  the 
following  progress  along  the  proposed  aims  of  the  study: 

1.  We  have  achieved  a  histopathological  classification  similar  to  the  human  and  initiated  the 
molecular  characterization  of  the  ovarian  lesions  induced  in  the  rat  ovary  following  local 
administration  of  low  dose-DMBA  with  or  without  stimulation  with  gonadotropin  hormones.  We 
have  demonstrated  that  hormone  co-treatment  leads  to  an  increased  lesion  severity,  which 
indicates  that  gonadotropins  may  indeed  promote  ovarian  cancer  progression.  We  have  also 
shown  that  point  mutations  in  the  Tp53  and  Ki-Ras  genes,  which  are  characteristic  of  human 
ovarian  carcinomas,  are  also  present  in  the  DMBA-induced  ovarian  lesions  in  the  rat.  Most 
importantly,  the  presence  of  such  mutations  in  putative  preneoplastic  lesions  confirms  their 
precursor,  clonal  character.  Additionally,  we  observed  an  overexpression  of  estrogen  and 
progesterone  receptors  in  “preneoplastic”  and  early  neoplastic  lesions  and  their  loss  in  advanced 
tumors,  suggesting  a  role  of  these  receptors  in  ovarian  cancer  development.  Our  data  indicate 
that  this  DMBA  animal  model  gives  rise  to  ovarian  lesions  that  closely  resemble  human  ovarian 
cancer  and  it  is  adequate  for  further  studies  on  the  mechanisms  of  the  disease  and  its  clinical 
management.  The  data  from  this  work  was  included  in  a  report  that  was  recently  published  in 
Cancer  Research1 2 . 

2.  The  goal  of  specific  aim  1  of  the  study  during  the  first  year  of  support  is  to  generate  a 
large  number  of  DMBA-induced  ovarian  lesions  in  the  rat  that  could  ensure  statistical  power  and 
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significance  of  the  findings;  using  microarray  gene  expression  analysis  to  initiate  their  molecular 
classification  into  groups  at  different  stages  of  neoplastic  development.  Immediately  prior  to  the 
beginning  of  the  first  year  of  support  by  this  grant  (Nov.-Dee.  2003),  using  funds  provided  by  the 
FCCC  NIH-OC-SPORE,  we  initiated  an  experiment  in  which  160  female  Sprague-Dawley  rats  at 
6  weeks  of  age  were  subjected  to  bilateral  survival  surgery  to  the  ovaries.  Animals  were 
separated  into  2  groups:  a)  Control  groups  al  (20  animals;  no  hormones)  and  a2  (40  animals; 
with  hormones):  beeswax-impregnated  surgical  sutures  were  implanted  in  the  portion  of  each 
ovary  that  is  contralateral  to  the  fallopian  tube;  b)  DMB A/hormone  group:  100  animals; 
DMBA/beeswax-impregnated  surgical  sutures  were  implanted  bilaterally  in  the  ovaries  of  the 
animals  as  above.  Two  months  following  the  surgical  procedure,  animals  (a2  and  DMBA)  were 
subjected  to  4  cycles  of  sequential  administration  of  PMSG  and  hCG  These  procedures  are 
described  in  detail  in  the  Experimental  Design  and  Methods  section  of  our  grant  proposal  and  in 
our  recent  publication  (Stewart  et  al. ').  All  treated  animals  were  maintained  for  up  to  12  months 
from  the  survival  surgical  procedure,  or  until  disease  development  and  animal  distress  became 
apparent.  Animals  were  sacrificed  following  guidelines  approved  by  the  FCCC  IACUC 
committee,  the  NIH  and  the  DoD-CDMRP. 

The  ovaries  of  all  animals  were  harvested  and  fixed  in  70%  ethanol  at  4°C  for  18hr, 
following  which  they  were  embedded  in  paraffin.  Three  5pm-thick  sections  ~50pm  apart  of 
each  other  were  obtained  from  the  two  end-portions  of  each  ovary,  stained  with  H&E  and 
subjected  to  histopathology  examination  to  determine  the  presence  of  ovarian  lesions.  Based  on 
their  histopathological  characteristics  and  stage  of  neoplastic  development,  the  detected  lesions 
were  classified  into  7  categories.  Control  ovarian  surface  epithelial  (OSE)  cells  obtained  from 
al,  a2  and  DMBA/hormone  ovaries  generated  3  additional  sample  categories.  So  far,  at  least  3 
samples  per  category  have  been  processed  further  for  laser-capture  microdissection  (LCM),  RNA 
purification  and  amplification.  Depending  on  the  size  of  lesion  and  its  epithelial  cell  component, 
4-6  5pm-thick  sections  were  generated  from  the  organ  portion  adjacent  to  the  corresponding 
H&E-stained  sections.  Similarly,  4-6  5pm-thick  sections  were  also  generated  from  control  al 
and  a2  ovaries.  Tissue  sections  were  deparaffinized,  rehydrated,  stained  with  HistoGene  LCM 
Frozen  Staining  Kit  (Arcturus),  dehydrated  and  stored  at  -80°C  until  they  were  used  for  LCM. 
Sections  were  subjected  to  LCM  to  select  epithelial  cell  component  (2,000-5,000  cells)  from 
corresponding  lesions  or  normal  OSE  on  CapSure  HS  LCM  Caps  using  an  AutoPix  Automated 
LCM  apparatus  (Arcturus).  Where  necessary,  individual  cells  were  selected  using  a  laser-beam 
diameter  of  7-  10pm.  Total  RNA  was  immediately  isolated  from  the  microdissected  cells  using 
the  PicoPure  RNA  Isolation  Kit  (Arcturus),  yielding  ~5ng  of  total  RNA.  RNA  quantification 
and  integrity  assessment  were  carried  out  by  micro  fluidic  electrophoresis  on  a  2100  Bioanalyzer 
using  the  RNA  6000  Pico  Chip  LabChip  Kit  (Agilent  Technologies).  Total  RNA  obtained  from 
all  microdissected  samples  was  subsequently  subjected  to  amplification  using  an  Ovation 
Aminoallyl  RNA  Amplification  and  Labeling  System  (NuGen  Technologies).  The  product  of 
this  amplification  is  anti-sense  aminoallyl-substituted  cDNA,  which  can  be  used  for  both 
oligonucleotide  and  cDNA  microarray  gene  expression  analysis,  as  well  as  for  real-time  qRT- 
PCR-based  verification  of  the  microarray  results. 

We  had  originally  proposed  to  use  rat  oligonucleotide  microarrays  to  be  produced  at  the 
FCCC  DNA  microarray  facility.  In  the  past  year,  our  facility  had  been  experiencing  problems 
with  inconsistency  in  the  quality  and  results  of  the  mouse  oligo-arrays  generated  with  the  MWG 
mouse  oligonucleotide  library.  Because  of  this  reason,  the  Human  Genetics  Research  Program  at 
FCCC  has  recently  purchased  the  Affymetrix  GeneChip  system,  a  package  offer  that  also 
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includes  GeneChip  microarrays  at  a  considerably  low  price.  This  system  includes  the  GeneChip 
Fluidics  Station450  for  automated  microarray  washes,  the  GeneChip  Scanner  3000  for 
microarray  image  capture,  and  the  GeneChip  Operating  Software  (GCOS)  V.1.2  for  microarray 
image  and  data  analysis.  Instead  of  investing  funds  and  effort  in  generating  rat  oligo-arrays  of 
inconsistent  quality  at  the  FCCC  microarray  facility,  we  decided  to  use  the  Affymetrix  U34A  Rat 
Genome  arrays.  These  contain  7,000  full-length  sequences  and  1,000  EST  clusters  from  the 
UniGene  database.  Given  the  high  quality  of  the  GeneChip  arrays  and  the  fact  that  our  Ovation- 
based  amplification  of  RNA  results  into  aminoallyl-substituted  anti-sense  cDNA,  which  is  well 
adapted  for  oligo-microarray  hybridization  and  gene  expression  analysis,  we  are  confident  that 
we  will  be  able  to  obtain  rapidly  reliable  and  reproducible  genomic  data  from  our  rat  ovarian 
lesion  samples.  This  will  be  further  facilitated  by  our  extensive  expertise  in  data  normalization 
and  mining.  We  have  recently  developed  an  algorithm  for  efficient  normalization  of  microarray¬ 
generated  datasets  from  multiple  experiments.  This  algorithm  was  tested  both  with  radioactively 
labeled  filter-macroarrays  as  well  as  with  Affymetrix  GeneChip  array  data  and  we  have 
demonstrated  that  it  is  superior  to  other  conventional  methods  using  mean,  median  or  linear 
regression  analysis2. 

3.  The  goal  of  specific  aim  2  of  the  study  during  the  first  year  of  support  is  to  initiate  a 
chemoprevention  trial  on  the  basis  of  the  DMBA/hormone  animal  model  of  ovarian  cancer 
developed  and  characterized  by  us.  Given  the  space  limitations  of  the  Laboratory  Animal 
Facility  (LAF)  at  FCCC  to  house  animals  subjected  to  treatment  with  carcinogens,  our  plan  was 
to  purchase  the  animals  for  this  study  as  soon  as  all  animals  treated  for  the  purpose  of  specific 
aim  1  were  sacrificed.  This  was  planned  for  November-December,  2004.  Animals  are  normally 
subjected  to  survival  surgery/carcinogenesis  1-2  weeks  after  their  transfer  to  our  LAF  and  no 
later  than  6-7  weeks  of  age.  The  goal  of  the  proposed  and  approved  by  the  scientific  review 
committee  chemoprevention  preclinical  trial  is  to  test  the  efficacy  of  the  COX-2  specific 
inhibitor  Celecoxib  to  prevent  the  appearance  and/or  progression  of  DMBA-induced  ovarian 
lesions.  Recently,  the  results  of  large  clinical  trials  with  this  and  other  COX-2  specific  inhibitors 
have  demonstrated  serious  toxicities  and  side  effects  on  the  basis  of  which  all  clinical  trials  have 
been  put  on  hold.  Because  of  this  reason,  we  decided  to  postpone  the  proposed  preclinical 
testing  of  Celecoxib  in  order  to  avoid  the  possibility  of  obtaining  results  that  may  no  longer  be 
relevant  for  the  clinic.  We  have  contacted  the  DoD-CDMRP  Grants  Manager,  Dr.  Naba  Bora, 
and  are  currently  discussing  alternative  agents  to  be  used  for  the  proposed  preclinical 
chemoprevention  study.  A  description  of  the  changes  in  the  design  of  the  preclinical 
chemoprevention  study  will  be  submitted  to  the  DoD-CDMRP  for  review  and  approval  prior  to 
its  initiation. 


KEY  RESEARCH  ACCOMPLISHMENTS 

The  following  are  the  key  research  accomplishments  during  the  first  year  of  support  by  this 
DoD-CDMRP  grant: 

•  Local  DMBA  administration  to  the  ovary  induces  ovarian  cancer  development  with 
distinct  preneoplastic,  and  neoplastic  stages1. 

•  Gonadotropin  hormones  contribute  to  ovarian  cancer  progression1. 
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•  Tp53  and  Ki-Ras  mutations  that  are  characteristic  for  human  ovarian  carcinomas  are 

present  in  DMBA-induced  preneoplastic  ovarian  lesions’. 

•  DMBA/gonadotropin  treatments  were  used  to  generate  multiple  ovarian  lesions  at 

different  stages  of  neoplastic  development  in  100  Sprague-Dawley  rats.  60  additional 
animals  were  treated  as  controls  (20  with  vehicles  alone,  and  40  with  DMBA-vehicle 
and  gonadotropins). 

•  Ovarian  epithelial  cells  were  harvested  from  ovarian  lesions  at  different  stages  of 

neoplastic  development  (10  categories  including  normal  OSE)  using  LCM 
microdissection. 

•  Total  RNA  was  purified  from  at  least  3  samples  per  ovarian  lesion  category  and  subjected 

to  linear  amplification  to  generate  aminoallyl-substituted  anti-sense  cDNAs.  The 
latter  will  be  used  as  probes  to  interrogate  8,000  unique  rat  genes  on  Affymetrix 
U34A  GeneChip  microarrays. 


REPORTABLE  OUTCOMES 

•  1  -  Stewart,  S.L.,  Querec,  T.D.,  Ochman,  A.R.,  Gruver,  B.N.,  Bao,  R.,  Babb,  J.S.,  Wong, 

T.S.,  Koutroukides,  T.,  Pinnola,  A.  D.,  Klein-Szanto,  A.,  Hamilton,  T.C.,  and 
Patriotis,  C.  Characterization  of  a  carcinogenesis  rat  model  of  ovarian  preneoplasia 
and  neoplasia.  Cancer  Res.,  64:  8177-83,  2004. 

•  2  -  Stoyanova,  R.  Querec,  T.D.,  Brown,  T.R.  and  Patriotis,  C.  Normalization  of  DNA 

arrays  by  Principal  Component  Analysis.  Bioinformatics,  20:1772-84,  2004. 

•  Sprague-Dawley  rat  ovaries  (200)  treated  with  DMBA  and  gonadotropins  and  containing 

ovarian  epithelial  lesions  at  different  stages  of  neoplastic  development. 

•  Total  RNA  purified  from  the  epithelial  component  of  above  ovarian  lesions  and  amino¬ 

allyl-substituted  anti-sense  cDNAs  obtained  through  linear  amplification  of  above 
RNAs.  The  latter  can  be  used  directly  for  microarray  (oligo  and  cDNA)  and  real-time 
qRT-PCR  gene  expression  analysis. 


CONCLUSIONS 

We  have  demonstrated  that  direct  application  of  a  low  dose  of  DMBA  in  the  rat  ovary,  alone 
or  combined  with  multiple  cycles  of  gonadotropin  administration,  elicits  a  neoplastic  process  that 
affects  mostly  the  OSE  and  leads  to  the  progressive  development  of  putative  epithelial  cell 
preneoplasia,  serous  low  malignant  potential  (LMP)  tumors,  and  invasive  carcinomas.  The 
similarity  in  histology  and  path  of  dissemination  of  the  DMBA-induced  rat  ovarian  carcinomas 
with  those  in  the  human,  as  well  as  the  presence  of  gene  mutations  that  are  common  in  human 
ovarian  cancer,  demonstrate  the  validity  of  this  animal  model  for  further  delineation  of  the 
mechanisms  underlying  ovarian  tumorigenesis.  Finally,  DMBA-induced  ovarian  oncogenesis  in 
the  rat  could  be  used  to  pre-clinically  test  new  agents  for  the  prevention  and/or  therapy  of  the 
disease. 
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ABSTRACT 

Motivation:  Detailed  comparison  and  analysis  of  the  out¬ 
put  of  DNA  gene  expression  arrays  from  multiple  samples 
require  global  normalization  of  the  measured  individual  gene 
intensities  from  the  different  hybridizations.  This  is  needed 
for  accounting  for  variations  in  array  preparation  and  sample 
hybridization  conditions. 

Results:  Here,  we  present  a  simple,  robust  and  accurate  pro¬ 
cedure  for  the  global  normalization  of  datasets  generated  with 
single-channel  DNA  arrays  based  on  principal  component  ana¬ 
lysis.  The  procedure  makes  minimal  assumptions  about  the 
data  and  performs  well  in  cases  where  other  standard  proced¬ 
ures  produced  biased  estimates.  It  is  also  insensitive  to  data 
transformation,  filtering  (thresholding)  and  pre-screening. 
Contact:  Christos.Patriotis@fccc.edu 

INTRODUCTION 

The  development  of  high-density  DNA  arrays  (oligonuc¬ 
leotide  and  cDNA)  has  revolutionized  our  ability  to  char¬ 
acterize  biological  processes  and  samples  genetically  by 
monitoring  the  relative  expression  of  thousands  of  genes  sim¬ 
ultaneously  (Bowtell,  1999;  Debouck  and  Goodfellow,  1999; 
Duggan  et  al .,  1999;  Lander,  1999).  To  meet  the  challenges 
for  interpretation  of  this  complex  data,  sophisticated  soft¬ 
ware  packages  have  become  available  for  analysis  of  the  gene 
expression  profiles,  such  as  ScanAnalyze  (Eisen  and  Brown, 
1999),  ArrayExplorer  (Patriotis  et  al .,  2001)  and  ImaGene 
(Biodiscovery,  Inc.).  An  important,  but  still  unresolved,  issue 
is  associated  with  the  normalization  of  the  relative  expression 
of  genes  across  a  series  of  microarray  experiments.  In  order  to 
compare  the  results  from  multiple  samples,  which  is  the  ulti¬ 
mate  goal  of  these  studies,  it  is  obligatory  that  the  individual 
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array  datasets  be  normalized  to  correct  for  the  inherent  exper¬ 
imental  differences.  The  critical  element  in  this  process  is  the 
discrimination  of  the  interesting,  biological  variation  from 
the  obscuring  variation,  which  is  related  to  the  experimental 
conditions  (Hartemink  et  al .,  2001).  This  is  why  the  initial 
attempts  towards  normalization  of  array  datasets  relied  on  the 
concept  that  a  group  of  genes  could  be  identified  a  priori  and 
serve  as  ‘housekeeping’  genes,  assuming  that  their  expres¬ 
sion  will  reflect  directly  the  obscuring  experimental  variation. 
As  discussed  in  detail  below,  if  such  a  subset  of  genes  could 
be  identified  reliably,  then  well-defined  normalization  factors 
could  be  estimated  to  within  the  accuracy  inherent  in  the  meas¬ 
urements.  Unfortunately,  as  shown  by  others  (Butte  et  al ., 
2001;  Selvey  et  al .,  2001)  and  by  us  in  this  report,  this  simple 
concept  works  only  in  very  limited  cases.  (Here  and  in  the 
rest  of  the  paper,  we  will  refer  to  the  a  priori  specified 
housekeeping  genes  as  ‘designated’  in  order  to  distinguish 
them  from  those  determined  to  be  the  ‘true’  housekeeping 
genes.  The  latter  represent  the  subset  of  genes  whose  expres¬ 
sion  is  invariant  to  the  particular  biological  and/or  experi¬ 
mental  variables  in  the  multiple  microarray  experiments  being 
compared.) 

The  realization  that  in  most  of  the  cases  the  ‘designated’ 
housekeeping  genes  cannot  be  used  for  reliable  normaliza¬ 
tion  has  spurred  the  development  of  alternative  approaches  for 
normalization.  The  majority  of  these  approaches  determine 
normalization  factors  on  the  basis  of  averages  over  the  beha¬ 
vior  of  the  entire  set  of  genes  measured  (Schuchhardt  et  al.% 
2000).  Typically,  these  methods  utilize  the  mean  or  median  of 
the  array  intensities  (Quackenbush,  2001)  and  linear  (Golub 
et  al .,  1999)  or  orthogonal  regression  (Sapir  and  Churchill, 
2000).  A  variety  of  non-linear  techniques  were  also  proposed 
(Schadt  et  al .,  2000,  2001 ;  Li  and  Wong,  2001 ;  Bolstad  et  al ., 
2003). 

There  is  also  a  series  of  methods  that  identify  a  subset  of 
genes  in  the  data  that  can  be  assumed  as  housekeeping  (Zien 
etal .,  2001;  Keplers  a/.,  2002).  All  these  approaches  perform 
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satisfactorily  when  the  following  two  assumptions  about  the 
data  are  met: 

(1)  the  majority  of  the  genes  (in  the  fitting  segment  for  the 
non-linear  approaches,  or  overall)  are  not  affected  by 
the  experimental  variables,  i.e.  they  can  all  be  regarded 
as  housekeeping  genes;  and 

(2)  the  subset  of  differentially  expressed  genes  are  ‘activ¬ 
ated’  symmetrically,  i.e.  the  overall  intensity  change  of 
up-  and  down-regulated  genes  is  similar. 

Here  we  present  a  novel  normalization  approach  that  per¬ 
forms  satisfactorily  even  when  the  conditions  above  are  not 
met,  which  is  the  most  commonly  observed  scenario.  In  con¬ 
trast  to  the  methods  requiring  the  selection  of  a  baseline  array, 
this  method  analyses  the  entire  dataset  simultaneously,  and,  as 
such,  it  is  considered  a  complete  data  method  (Bolstad  et  al ., 
2003).  The  goal  of  the  technique  is  to  determine  in  a  multi¬ 
array  experiment  if  there  is  a  subset  of  genes  whose  expression 
may  be  considered  unaffected  by  the  ‘interesting’  (biological) 
sources  of  variation  and  if  there  are  such,  to  identify  this  set  of 
specific,  ‘data-driven’  housekeeping  genes  and  use  them  for 
normalization.  Briefly,  if  the  results  from  each  array  meas¬ 
urement  are  represented  in  a  multi-dimensional  vector  space 
where  each  axis  is  a  different  sample,  then  the  entire  experi¬ 
ment  can  be  represented  as  a  series  of  points  corresponding  to 
the  strength  of  each  gene’s  expression  in  each  sample  meas¬ 
ured.  If  a  set  of  genes  with  an  unchanged  relative  expression 
is  present,  their  intensity  levels  will  represent  points  along  a 
straight  line  through  the  origin.  We  present  a  principal  com¬ 
ponent  analysis  (PCA)-based  method  for  identifying  such  a 
line,  if  one  exists.  The  factors  determined  from  the  expression 
of  these  genes  can  be  used  to  normalize  the  gene  expression 
in  the  individual  array  datasets. 

MATERIALS  AND  METHODS 

Theory 

Consider  a  gene  expression  dataset  consisting  of  m  arrays 
with  n  genes  each.  Let  D  be  the  data  matrix  containing  in 
its  rows  the  measured  expression  levels,  and  let  gij  be  the 
measured  expression  level  of  the  i-th  gene  in  the  y-th  array 
(i  =  1, . . . ,  n,  j  =  1, . . . ,  m).  We  seek  to  identify  a  subset,  S, 
of  s  genes  (s  <  n)  whose  expression  remains  constant  over 
the  experimental  conditions  of  the  study.  Mathematically,  for 
the  genes  in  S  the  following  equations  hold: 

qjgij  =  Ci  or  gij  =  Ci/qj, 

where  qj  is  the  y'-th  normalization  constant  and  c,  is  the  true 
concentration  of  the  i-th  gene,  which  is  constant  across  the 
samples.  If  we  plot  the  points  gij  in  an  m-dimensional  space, 
we  can  see  that  they  lie  along  a  line  through  the  origin,  which 
has  projections  along  the  axes  of  { 1  /qj }.  If  we  can  find  such  a 
line,  we  will  have  identified  our  desired  relative  normalization 


constants  (relative  since  unless  at  least  one  of  the  cy  s  is  known, 
it  is  impossible  to  normalize  the  data  absolutely). 

We  now  turn  to  the  problem  of  identifying  the  genes  in  S. 
The  obvious  method  is  to  calculate  the  densities  in  the  cloud 
of  n  data  points  in  the  m -dimensional  data  space,  which  rep¬ 
resent  the  directions  of  n  gene  levels  in  the  m  observations.  In 
reality,  this  is  difficult  because  there  are  approximately  Nm~x 
directions  for  examining  if  each  orientation  is  divided  into  N 
segments.  In  order  to  reduce  the  dimensions  of  the  space  that 
needs  to  be  examined,  we  use  PCA  to  identify  the  directions 
along  which  the  principal  variations  of  the  genetic  expressions 
lie  in  the  original  m -dimensional  space.  We  project  the  data 
points  onto  the  first  two  of  these  directions  and  examine  their 
angular  distribution  to  determine  if  a  line  through  the  origin 
is  present.  Note  that  the  original  line  in  the  full  space  need  not 
lie  in  this  plane  as  its  projection  into  the  plane  will  also  be  a 
line  through  the  origin. 

PCA  is  used  commonly  for  reducing  the  dimensionality  of 
complex  data  (Anderson,  1971)  and  has  been  used  previously 
in  the  analysis  of  microarray  data  from  time-course  experi¬ 
ments  (Alter  et  al .,  2000,  2003),  for  normalization  of  gene 
expression  ratios  obtained  from  two  different  microchips  of 
two-channel  arrays  (Nielsen  et  al .,  2002)  and  for  partition¬ 
ing  large-sample  microarray-based  gene  expression  profiles 
(Peterson,  2003).  It  is  also  an  inseparable  part  for  exploration 
of  large  genomic  datasets  (Misra  et  al.%  2002).  Previously, 
we  have  applied  the  PCA  technique  for  removing  ‘unwanted’ 
variation  in  multi -spectral  datasets  (Stoyanova  and  Brown, 
2002). 

Briefly,  PCA  identifies  the  directions  of  the  largest  vari¬ 
ations  in  the  data  via  the  principal  components  (PCs),  and 
represents  the  data  in  a  coordinate  system  defined  by  the 
PCs  (Pi,  P2,  • . .)»  as  follows: 

E)  =  Pi  Pi  +  P2P2  +  ^3^3  +  *  *  *  +  Rm  Pm»  (1) 

where  Pj  (1  x  m)  and  Rj  (n  x  1)  are  row  and  column  matrices; 
Rj  contain  the  projections  of  the  data  along  the  PCs  (y  = 
1, . . .  ,m),  generally  called  scores.  Below,  some  of  the  relevant 
properties  of  the  PCs  are  listed. 

(1)  Pj  are  eigenvectors  of  the  data-co variance  matrix  (cal¬ 
culated  around  the  origin,  rather  than  around  the  mean) 
and  are  orthonormal,  i.e. 


(2)  The  PCs  are  ordered  by  the  decreasing  amount  of  vari¬ 
ation  in  the  data  they  explain.  Let  Aj,  A2, ...,AW 
be  the  eigenvalues  of  the  covariance  matrix  (Aj  > 
A2  >  •  •  •  >  Am).  Each  PC  explains  a  portion  of  the 
total  variance  of  D,  proportional  to  its  corresponding 
eigenvalue. 

(3)  The  magnitude  of  Rj  is  proportional  to  its  correspond¬ 
ing  eigenvalue.  Ay. 
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(4)  I)  can  be  represented  sufficiently  with  fewer  than  m 
PCs  [Equation  (1)].  PCA  provides  a  representation  of 
the  data  in  a  lower-dimensional  space  of  significant 
variables. 

(5)  The  PCs  are  a  linear  combination  of  the  original  data. 
The  coefficients  of  this  linear  combination  (/?,)  are 
typically  referred  to  as  loadings  and  represent  the  pro¬ 
jections  of  the  PCs  along  the  axes  of  the  original 
m -dimensional  space. 

(6)  The  PCs  minimize  the  squared  distances  of  the  variables 
(gene-expression  levels)  and  themselves. 

From  the  last  three  properties,  it  follows  that  the  loadings  of  the 
first  PC  may  serve  as  normalization  coefficients  of  the  arrays. 
In  many  cases,  when  the  assumptions  ( 1 )  and  (2)  (see  Introduc¬ 
tion)  are  met,  as  discussed  in  detail  below,  PCA  can  provide 
directly  the  normalization  coefficients  sought.  In  other  cases, 
we  can  use  the  first  two  PCs  to  detect  linear  behavior  in  a  sub¬ 
set  of  genes  S  (s  <  n)  that  are  the  ‘true’  housekeeping  genes. 
PCA  applied  only  to  the  genes  in  S  will  identify  the  appropri¬ 
ate  normalization  line  in  the  entire  m-dimensional  data  space. 
Its  projections  can  then  be  used  as  normalization  factors. 

The  procedure  [dubbed  PCA(line)]  tests  automatically 
for  the  existence  of  and  detects  the  group  of  genes,  which 
are  distributed  ‘tightly’  along  a  line  in  the  plane  defined  by 
the  first  two  PCs.  We  chose  this  plane  because  by  defini¬ 
tion  it  contains  the  largest  variations  in  the  expression  levels. 
Although  the  actual  straight  line  of  the  desired  normaliza¬ 
tion  may  not  lie  completely  in  this  plane,  its  projection  in 
the  plane  is  also  a  straight  line  and  will  serve  to  identify  the 
desired  set  of  genes.  To  identify  such  a  line,  we  divide  the  part 
of  the  plane  that  contains  all  the  points  into  small  angular  seg¬ 
ments  and  determine  the  number  of  data  points  (genes)  in  each 
segment.  The  segment(s)  containing  the  data-driven  house¬ 
keeping  genes  will  contain  a  disproportionally  large  density 
of  points.  This  procedure  is  described  below  and  given  in 
detail  in  Appendix  1. 

Initially,  we  assume  S  is  an  empty  set  (S  =  0).  In  the  plane 
defined  by  P\  and  Pi,  we  partition  the  angle  through  the  origin 
defined  by  the  genes  with  maximal  and  minimal  components 

on  Pi  in  p  equal  angular  segments.  Let  s*  (k  =  1 . p) 

be  the  subset  of  genes  in  D,  that  belong  to  the  k-th  segment 
(s\  U  si  U  •  •  •  U  sp  =  D).  We  recommend  that  p  be  set  initially 
to  contain  on  average  at  least  10  genes  per  segment.  Let  0* 
be  the  angular  densities  defined  as  the  number  of  genes  in 
each  segment,  s* ,  and  Af(0*)  and  V(0*)  be,  respectively,  the 
sample  mean  and  variance  of  0*.  Then,  the  density  of  the  £-th 
segment  is  considered  to  be  significant  if 

Ok  >  Mm+nJvm.  (2) 

where  p  is  a  parameter  indicating  the  number  of  standard 
deviations  above  the  mean  that  is  required  for  significance.  If 
a  normal  distribution  of  Ok  is  assumed,  then  p  =  1.96  will 


correspond  to  a  one-sided  test  with  a  type-I  error  of  2.5%. 
However,  in  most  cases,  due  to  different  procedures  for 
microarray  image  quantification  as  well  as  the  specific  pre¬ 
filtering  of  the  data,  the  distribution  of  Ok  is  unknown.  In 
cases  where  a  normal  distribution  of  Ok  cannot  be  assumed,  it 
is  recommended  that  their  histogram  be  examined  and  p  be 
set  appropriately.  For  added  stringency  of  the  test,  the  genes 
in  segment  j*  are  assumed  to  be  housekeeping  genes  only  if 
0k+ 1  of  the  neighbouring  segment  Sk+ 1  is  also  tested  signific¬ 
ant.  Then  the  genes  in  the  two  segments  are  merged  in  S,  i.e. 
S  =  Sk  U  Sk+ 1-  If  the  angular  density  of  the  genes  of  further 
contiguous  segments  is  detected  to  be  significant,  then  these 
genes  are  added  to  S.  After  all  segments  are  tested,  PCA  is 
applied  to  S  and  the  reciprocal  values  of  the  loadings  of  the 
resultant  first  PC  are  used  as  normalization  coefficients. 

If  the  procedure  failed  to  identify  at  least  two  significant 
contiguous  segments,  then  either  all  the  genes  in  the  data  can 
be  assumed  to  be  housekeeping  (S  =  D),  or,  in  the  extreme 
situation,  the  housekeeping  genes  are  either  too  few  to  be 
detected  or  not  existent  (S  =  0).  In  the  first  case,  the  loadings 
of  the  first  PC  from  the  initial  PCA  of  D  are  the  true  normal¬ 
ization  coefficients  and  can  be  used  for  direct  normalization. 
There  is  not  very  much  to  be  done  in  the  second  case — the 
PC  A-derived  normalization  would  be  as  erroneous  as  the  ones 
produced  by  any  other  linear  technique.  Let  k\  be  the  fraction 
(in  per  cent)  of  the  first  eigenvalue,  A  i ,  from  the  total  variance 
in  the  data.  In  this  case,  a  low  (in  our  experience  <60%) 
will  be  indicative  of  a  lack  of  normalizing  genes. 

Biological  samples  (datasets) 

Human  ovarian  surface  epithelial  cell  lines  Microarray 
datasets  obtained  from  experiments  with  RNA  of  human 
ovarian  surface  epithelial  (HOSE)  cells  were  analyzed  using 
Atlas  1.2  Human  arrays  (ClonTech).  The  details  of  array  pre¬ 
paration  and  data  extraction  are  described  elsewhere  (Patriotis 
et  al .,  2001).  Briefly,  the  HOSE  cells  were  derived  from 
a  short-term  primary  cell  culture  obtained  from  one  of 
the  ovaries  of  an  individual  predisposed  to  ovarian  cancer. 
The  short-term  HOSE  cell  culture  was  transduced  with  a 
Cytomegalovirus-based  vector  expressing  the  Simian  Virus- 
40  large  T-antigen.  As  a  result,  the  in  vitro  lifespan  of  the 
cells,  while  still  ‘mortal’  (1 18M),  was  considerably  extended, 
leading  to  the  spontaneous  outgrowth  of  an  ‘immortal ’/non- 
transformed  cell  line  (1181m).  Following  multiple  passages 
in  culture,  the  1181m  cell  line  gave  rise  spontaneously  to 
cells  that  acquired  anchorage-independent  growth  character¬ 
istics  and,  ultimately,  the  potential  to  grow  tumours  in  vivo 
when  inoculated  in  nude  mice  (1 18NuTu)  (Frolov,  A.  et  al ., 
unpublished  data).  In  the  first  experiment,  the  cDNA  probes 
were  derived  from  total  RNA  purified  from  1 18M,  1 181m  and 
118NuTu.  In  the  second  experiment,  microarray  data  were 
obtained  from  1 18NuTu  cells  treated  for  different  lengths  of 
time  (0,  24,  48  and  72  h)  with  the  synthetic  retinoic  acid 
derivative  Fenretinide  (4-HPR)  (Moon  et  al .,  1979). 
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Lymphoma  data  (LD) 

The  dataset  was  constructed  from  the  supplementary  datasets 
of  Golub  et  al.  (1999).  The  microarray  measurements  were 
performed  with  RNA  of  samples  obtained  from  bone  marrow 
and  peripheral  blood  from  patients  with  acute  lymphoblastic 
leukemia  (ALL)  or  acute  myeloid  leukemia  (AML)  at  the  time 
of  diagnosis  using  high-density  oligonucleotide  Affymetrix 
arrays.  In  the  paper  referred  to,  the  data  were  normal¬ 
ized  by  pair-wise  linear  regression  (LR)  between  the  first 
sample  (baseline)  and  the  rest  of  the  samples  in  the  data¬ 
set.  Only  genes  with  satisfactory  quality  (marked  with  ‘P’ 
in  the  datasets  provided)  in  each  pair  were  considered  for  the 
regression.  The  normalized  datasets,  as  well  as  the  normaliz¬ 
ation  factors,  are  supplied  at  http://www-genome.wi.mit.edu/ 
cgi-bin/cancer/datasets.cgi.  The  data  used  here  were  non- 
processed  and  ‘non-normalized’,  and  the  combined  datasets 
resulted  in  a  data  matrix  containing  72  arrays  and  7129  genes. 

Simulated  data 

The  values  in  the  simulated  datasets  were  chosen  to  be  real¬ 
istically  probable,  based  on  our  experience  with  data  obtained 
with  the  Atlas  1.2  CLONTECH  arrays  (Patriotis  et  al .,  2001). 
The  number  of  genes  was  set  to  500,  in  agreement  with 
our  observation  that  between  30  and  50%  of  the  genes  are 
expressed  in  any  of  the  samples  investigated  in  our  lab.  In 
the  first  array,  the  expression  levels,  gn  [in  arbitrary  units 
(a.u.)],  were  simulated  using  the  relation  gn  =  2M,  where  u 
is  uniformly  distributed  between  1  and  16. 

In  all  simulated  datasets  of  pairs  of  arrays  a  multiplication 
factor  of  1.2  was  applied  to  the  second  array,  equivalent  to 
q i  =  l  and  qi  =  1.2.  Gene  intensities  were  assumed  to  be 
background-corrected,  and  (unless  noted  otherwise)  signals 
with  intensities  less  than  200  were  zeroed  (thresholded). 

‘Noise’  data 

The  sources  of  noise  in  microarray  datasets  are  multiple  and 
complex,  and  they  contribute  simultaneously  with  variable 
amounts  to  the  total  variance  in  the  data.  Generally,  the  total 
noise  contribution  to  the  measured  signal  represents  a  vari¬ 
able  mixture  of  the  contribution  of  two  components:  one  is 
independent  of  gene  intensity  and  affects  the  expression  of  all 
genes  equally,  and  the  other  is  gene-dependent  and  increases 
with  the  magnitude  of  the  gene  expression.  To  investigate 
the  contribution  of  noise  to  the  process  of  normalization,  we 
simulated  two  pairs  of  replicate  arrays,  as  described  above. 
Random  noise  was  added  to  each  array.  In  the  first  set,  the 
noise  was  gene  independent  (N\ ) — uniformly  distributed  ran¬ 
dom  noise  between  -2500  and  2500 — and  in  the  second  set, 
a  gene-dependent  (Ni),  uniformly  distributed  noise  whose 
magnitude  was  ±10%  of  the  gene  intensities.  Formally, 


A/,  =  —2500  +  5000  u 
Ni  =  ^(2«  -  1) 


u  =  U(  0, 1). 


(3) 


‘Signal’  dataset  1 

‘Signal’  dataset  1  (SD1)  contained  two  pairs  of  simulated 
arrays.  The  first  pair  satisfied  conditions  (1)  and  (2)  (see  Intro¬ 
duction)  by  choosing  a  substantial  number  of  the  genes  to  be 
housekeeping  (250)  and  the  number  and  magnitude  of  change 
of  up-  and  down-regulated  genes  to  be  equal.  The  second  pair 
was  constructed  to  illustrate  a  scenario  where  these  assump¬ 
tions  are  not  met:  the  housekeeping  genes  (150)  were  not 
a  majority,  and  more  genes  were  ‘up-regulated’  (200)  than 
‘down-regulated’  (150)  (the  details  about  the  simulated  up- 
and  down-regulation  are  given  in  Appendix  2).  Two  independ¬ 
ent  sets  of  random  noise  were  added  to  each  array,  generated 
as  the  sum  of  half  of  both  gene-dependent  and  -independent 
noise  [Equation  (3)],  i.e.  ^(A/i  +  A/2). 

‘Signal’  dataset  2 

‘Signal’  dataset  2  (SD2)  contained  eight  arrays  with  500  genes 
each.  The  first  array  in  SD2  was  generated  randomly,  as 
described  above.  The  gene  expression  levels  of  the  remain¬ 
ing  seven  arrays  were  generated  with  the  idea  of  recreating 
a  scenario  where  progressive  changes  occur  in  the  studied 
samples  (e.g.  time-response  to  treatment  or  undergoing  a  pro¬ 
cess  of  immortalization  and  malignant  transformation).  The 
details  of  simulation  parameters  for  up-  and  down-regulation 
are  given  in  Appendix  3.  The  arrays  were  multiplied  with 
coefficients  generated  at  random  between  0.3  and  3.  Finally, 
random  noise,  generated  as  described  for  SD1,  was  added  to 
each  array. 

RESULTS 

Housekeeping  genes  in  HOSE  cells 

Figure  1(a)  depicts  the  correlation  plot  of  the  ‘designated’ 
housekeeping  genes  in  the  first  experiment  with  HOSE  cells: 
1 18Monthe;c-axis,  andonthey-axis  1 181m  (black  series)  and 
1 18NuTu  (gray  series).  The  expression  of  these  genes  is  well 
correlated  ( R 2  =  0.96),  and,  in  this  case,  they  can  be  used  for 
normalization  of  the  data.  Figure  1(b)  depicts  the  correlation 
plot  of  the  expression  of  the  same  set  of  housekeeping  genes 
in  the  1 18NuTu,  untreated  (0  h,  jc-axis)  and  treated  with  4- 
HPR  for  24,  48  and  72  h  (y-axis;  black  circles,  gray  triangles 
and  shaded  squares,  respectively).  In  this  case,  the  correla¬ 
tion  between  the  expression  of  the  ‘designated’  housekeeping 
genes  is  quite  poor  ( R 2  =  0.43,  0.81  and  0.85,  respectively). 
From  these  data,  it  is  clear  that  the  expression  profiles  of  the 
‘designated’  housekeeping  genes  are  changed  non-uni formly 
in  the  cells  in  response  to  the  drug  treatment. 

‘Noise’  data 

Figure  2(a)  and  (b)  (left  panels)  depict  the  correlation  between 
the  data  in  the  two  pairs  of  simulated  arrays  in  this  dataset 
together  with  the  linear  trendline  through  the  origin.  Note  that 
the  regression  coefficient  in  both  cases  is  very  close  to  the  true 
value  of  the  multiplication  factor  1.2.  The  fit  is  slightly  tighter 
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Fig.  1.  Correlation  plots  of  the  intensities  of  the  ‘designated’  housekeeping  genes  in  two  microarray  experiments,  (a)  HOSE  cell  lines  at 
different  stages  of  malignancy,  on  the  jc-axis  1 18M,  and  on  the  y-axis,  1 181m  (black)  and  1 18NuTu  (gray).  Regression  lines  are  indicated  in 
black  and  gray,  respectively;  (b)  1 18NuTu  cell  line  following  treatment  with  Fenretinide.  on  the  jc-axis  at  0  h  and  on  the  y-axis  after  24  (black 
circles),  48  (gray  triangles)  and  72  h  (squares)  of  treatment.  Regression  lines  are  indicated  in  black  solid,  black  dashed  and  gray,  respectively 
(note  that  the  black  solid  and  black  dashed  regression  lines  are  overlapping). 


for  the  second  dataset  ( R 2  =  0.986  versus  R2  =  0.992), 
which  reflects  the  smaller  contribution  of  the  noise  in  the 
overall  gene  intensities.  Figure  2(c)  (left  panel)  depicts  the 
correlation  between  two  replicate  array  datasets  obtained  from 
1 1 8M.  The  genes  depicted  by  gray  squares  represent  the  ‘des¬ 
ignated’  housekeeping  genes.  On  the  right  panels  in  Figure  2 
the  correlation  of  the  logarithmic  transforms  of  the  data  from 
the  left  panels  are  presented  (due  to  the  restriction  of  the  logar¬ 
ithmic  function  to  only  positive  numbers,  for  this  comparison, 
only  genes  that  are  expressed  simultaneously  in  the  two  arrays 
are  used).  Comparison  of  the  graphs  of  simulated  [Fig.  2(a) 
and  (b)]  and  real  [Fig.  2(c)]  noise  indicates  the  similarity  in 
the  overall  distributions,  although  the  real  data  have  a  greater 
variance. 

‘Signal’  dataset  SD1 

The  graphs  of  the  two  pairs  of  arrays  in  this  dataset,  together 
with  the  regression  line  through  the  origin,  are  presented  in 
Figure  3.  The  housekeeping  genes  are  marked  in  green.  In 
the  case  of  the  first  pair  [Fig.  3(a)],  it  is  clear  that  the  regres¬ 
sion  line  is  along  the  line  of  normalization  and,  therefore, 
all  the  above  reference  normalization  methods  will  perform 
well.  Obviously,  this  is  not  the  case  with  the  second  data¬ 
set  [Fig.  3(b)],  and  we  applied  the  PCA  (line)  procedure  for 
determining  the  subset  of  housekeeping  genes. 

After  thresholding,  296  genes  were  found  with  non-zero 
intensities  simultaneously  in  both  arrays  (132  up-regulated, 
88  down-regulated  and  76  housekeeping).  PCA  was  applied 
to  this  set  (A.  i  =  96%).  The  representation  of  the  data  along 
the  first  two  PCs  is  shown  in  Figure  4(a)  [note  that  the  first 


PC,  Pi,  is  along  the  regression  line  of  this  rotated  version 
of  Fig.  3(b)].  The  procedure  for  automatic  detection  of  the 
housekeeping  genes  is  schematically  illustrated  in  Figure  4(b). 
The  angle  encompassing  all  data  points  (between  1.069  and 
2.438  radians)  was  divided  into  50  segments.  The  histogram 

of  the  angular  densities  0k  {k  =  1,2 . 50)  is  presented  in 

Figure  4(c)  [M(0k)  =  5.92  and  TV®  =  5.18].  For  /z  = 
1 .96,  three  contiguous  segments,  starting  at  p  =  22,  contained 
points  with  a  significantly  higher  density  [Equation  (2)].  A 
total  of  63  points  (subset  S)  from  these  segments  were  extrac¬ 
ted.  These  genes  (orange  points),  together  with  the  original  set 
of  housekeeping  genes  (in  green),  are  presented  in  Figure  4(d). 
The  collinearity  between  the  identified  genes  and  the  house¬ 
keeping  genes  is  apparent.  Thirty-two  of  the  genes  in  S  belong 
to  the  original  set  of  76  housekeeping  genes  in  the  analyzed 
data,  indicating  that  the  procedure  recovered  successfully  a 
substantial  fraction  of  them  (32/76,  or  >40%).  Moreover,  the 
procedure  detected  an  additional  31  genes  whose  expression 
changes  in  accordance  with  a  housekeeping  gene  behavior. 
PCA  was  applied  to  the  data  in  S  (A.i  =  99%),  and  the 
first  PC  loading  factors  were  q\  =  0.635  and  qi  =  0.773, 
corresponding  to  a  relative  normalization  factor  of  1 .2 17. 

Simulated  dataset  SD2 

PCA  was  applied  to  205  genes  with  non-zero  intensities  in 
all  eight  arrays  (88  up-regulated,  52  down-regulated  and  64 
housekeeping)  (A.i  =  96%).  The  points  in  the  Pi  and  Pi 
plane  were  within  1.079  and  1.938  radians.  As  in  the  case  of 
SD1,  the  densities  of  points  in  50  segments  were  calculated 
(M(0k)  =  4.08  and  y/VW)  =  5.21).  For  /z  =  1.96,  three 
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Fig.  2.  Correlation  plots  of  gene  intensities  in  replicate  arrays,  displayed  on  untransformed  (left  panels)  and  logarithmic  scales  (right  panels) 
with  indicated  LR  line  (gray):  (a)  simulated  data,  containing  gene-independent  noise;  (b)  simulated  data,  containing  gene  intensity-dependent 
noise;  (c)  two  replicate  arrays  of  1 18M  cell  line.  The  genes  shown  in  gray  squares  represent  the  designated  housekeeping  genes  included  in 
the  arrays  by  the  manufacturer. 
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Fig.  3.  Correlation  plots  of  gene  intensities  of  two  simulated  array  datasets  (SD1)  with  indicated  housekeeping  genes  (green  squares)  and 
indicated  LR  line  (orange):  (a)  ‘symmetric’  case,  where  the  majority  of  the  genes  are  housekeeping  and  the  number  and  magnitude  of  up- 
and  down-regulated  genes  is  similar;  (b)  the  housekeeping  genes  are  of  a  relatively  smaller  number,  and  the  up-regulated  genes  dominate  the 
distribution. 


contiguous  segments  containing  a  total  of  64  points  (subset  S) 
contained  a  significant  number  of  points.  The  majority  of 
the  points  in  S  belonged  to  the  original  set  of  housekeep¬ 
ing  genes  analyzed  (44,  or  69%),  and  the  remaining  20  were 
split  between  the  12  up-regulated  and  eight  down-regulated 
genes.  PC  A  was  applied  to  the  data  in  S  (k\  =  99%),  and  the 

normalization  coefficients  qj(J  =  1 . 8)  were  calculated 

as  the  loadings  of  the  first  PC. 

We  compared  the  accuracy  of  the  PCA(line)-estimated  nor¬ 
malization  factors  with  the  ones  estimated  by  LR  and  mean 
(MEAN).  We  scaled  all  normalization  factors  so  that  their 
sum  was  equal  to  1,  and  the  correlation  between  the  true 
values  C*-axis)  and  the  estimated  values  (y-axis)  are  presen¬ 
ted  in  Figure  5(a).  Although  the  overall  correlation  between 
the  true  and  estimated  normalization  factors  is  quite  good 
|  R2  =  0.9964,  0.9862  and  0.9726  for  PCA(line),  LR  and 
MEAN  estimates,  respectively],  it  is  clear  that  PCA(line) 
provides  the  best  estimates.  We  also  calculated  the  error  for 
each  individual  array,  defined  as  the  percentage  difference 
of  the  estimated  from  the  true  normalization  factor,  and  the 
minimum,  maximum  and  average  error  values  are  presented 
in  Figure  5(b).  This  analysis  indicated  that  the  error  of  the 
PCA(line)-derived  estimates  is  on  average  lower  by  a  factor 
of  2  and  3  as  compared  with  the  ones  derived  by  LR  and 
MEAN,  respectively. 

We  further  investigated  the  effect  of  data  thresholding  on  the 
PCA(line)  procedure.  We  re-analyzed  SD2  by  applying  PCA 
to  all  500  genes  in  the  dataset.  Since  some  of  the  scores  along 
?2  were  negative,  the  data  points  spanned  the  entire  plane 
(between  0.03  and  6.27  radians).  In  this  case,  we  set  p  =  200 
and  p  =  4.  Two  consecutive  segments  [Fig.  5(c)],  containing 


a  total  of  77  genes,  were  determined  to  have  significant  angu¬ 
lar  densities.  The  overwhelming  majority  of  genes  (55)  in  this 
set  belonged  to  the  original  set  of  housekeeping  genes.  The 
housekeeping  gene  sets  derived  by  PCA  (line)  on  thresholded 
and  unfiltered  data  were  strongly  overlapping — all  but  four 
were  identical  to  the  64  housekeeping  genes  determined  with 
the  thresholded  data.  Finally,  the  PC  A -determined  normaliz¬ 
ation  factors  in  this  case  were  virtually  identical  to  the  ones 
determined  with  the  thresholded  data. 

Lymphoma  Data 

PCA  was  applied  to  all  7129  genes  in  the  dataset  {k\  = 
88.31%).  All  loadings  of  P\  were  scaled  by  the  first  one, 
resulting  in  a  normalization  factor  of  1  for  the  first  array. 
Figure  6(a)  depicts  the  comparison  between  LR-  and  PCA- 
derived  (yellow  circles)  values.  The  high  correlation  (R2  = 
0.99)  between  the  two  series  is  apparent.  Further,  we  applied 
the  PCA(line)  procedure.  Three  contiguous  segments  (from  a 
total  of  200),  containing  1095  genes,  were  above  the  threshold 
[Mifik)  =  35.64,  TV®  =  72.21,  p  =  4].  PCA  was  applied 
to  the  intensities  of  the  genes  in  S  (X\  =  93.85%)  and  the  load¬ 
ings  of  P\  rescaled  appropriately  and  compared  with  the  LR 
results  [Fig.  6(a),  black  circles].  While  showing  an  overall 
good  agreement  with  the  LR-derived  results  (/?2  =  0.92), 
they  also  indicate,  in  some  individual  cases,  substantial  dif¬ 
ferences  with  the  PCA(line)-estimated  values.  The  average 
absolute  value  of  the  relative  difference  between  LR-  and 
PCA-derived  factors  was  7.52%,  with  a  range  of 0.07-30.84% 
in  the  case  of  array  #65  [Fig.  6(a),  marked  with  an  arrow].  We 
then  examined  the  correlation  of  the  intensities  of  the  genes 
marked  with  ‘P’  (those  of  satisfactory  quality)  in  arrays  #  1 
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Fig.  4.  (a)  The  data  from  Figure  3b,  presented  in  the  PC-plane;  (b)  schematic  illustration  of  segmentation  of  the  part  of  the  PC-plane  containing 
the  data;  (c)  histogram  of  the  angular  densities  of  the  segments;  (d)  ‘true’  (green)  and  PCA(line)-detected  housekeeping  genes  (orange). 


and  #  65  [Fig.  6(b)].  The  normalization  lines  [represented  in 
orange  and  blue,  respectively,  for  LR  and  PCA(line)]  indicate 
that  in  the  case  of  LR,  a  handful  of  strongly  expressed  genes 
are  driving  the  normalization.  A  similar  graph  was  obtained 
with  arrays  #1  and  #58,  which  also  showed  a  large  difference 
between  the  two  normalization  procedures. 

To  determine  how  the  number  of  segments  in  the  plane 
impacts  the  estimated  normalization  coefficients,  we  ran  the 
procedure  with  p  =  100,  300,  400  and  500.  In  all  cases, 
the  procedure  extracted  essentially  the  same  subset  of  nor¬ 
malizing  housekeeping  genes.  The  number  of  genes  for  each 
p  was  1410,  1192,  1092  and  1 162,  respectively.  We  estim¬ 
ated  a  (5  x  5)  correlation  matrix  of  the  derived  normalization 
factors  for  each  value  of  p.  All  coefficients  in  the  correlation 
matrix  were  greater  than  0.994,  indicating  the  high  degree 
of  reproducibility  between  the  derived  normalization  factors 
for  different  numbers  of  segments  ( p ).  We  also  estimated 


the  coefficient  of  variation  (COV)  between  the  five  series  of 
estimates.  The  average  COV  for  the  72  normalization  factors 
was  1.71%. 

DISCUSSION 

Normalization  of  gene  intensities  in  multi-array  experiments 
is  crucial  for  the  ultimate  biological  interpretation  to  be 
meaningful  (Hoffmann  et  al. ,  2002).  Only  after  proper  nor¬ 
malization  can  changes  in  expression  of  a  given  gene  amongst 
the  studied  samples  in  the  experiment  be  characterized  quant¬ 
itatively.  Conversely,  erroneous  (or  no)  normalization  may 
lead  to  inaccurate  estimation  of  the  changes  in  gene  expres¬ 
sion  including  wrong  conclusions  with  regard  to  their  up-  or 
down-regulation.  While  optimal  normalization  is  still  a  sub¬ 
ject  of  discussion,  individual  investigators  are  faced  daily 
with  many  questions  about  the  analysis  of  these  complex 
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Fig.  5.  (a)  Relation  of  ‘true’  normalization  factors  and  factors  estimated  via  PCA(line),  LR  and  MEAN  in  a  simulated  dataset  containing 
eight  arrays.  The  black  line  indicates  the  line  of  identity;  (b)  ranges  (minimum  and  maximum)  and  average  of  the  absolute  values  of  relative 
errors  of  estimation  of  the  normalization  factors  in  the  three  estimates;  (c)  histogram  of  the  angular  densities  of  the  segments  in  the  PCA(line) 
for  unfiltered  data. 


data.  For  example,  should  the  array  data  be  logarithmic¬ 
ally  transformed  prior  to  normalization;  should  low  intensity 
spots  be  discarded,  and,  if  so,  what  is  the  right  cut-off 
limit  for  this  operation;  should  the  mean  or  median  intens¬ 
ity  of  the  arrays  be  used  for  normalization;  or  alternat¬ 
ively,  do  ‘designated’  housekeeping  genes  play  reliably  their 
assigned  role? 

In  this  report,  we  address  all  these  questions  and  present  a 
simple  procedure  for  normalization  of  datasets  generated  with 
single-channel  arrays  based  on  PC  A.  The  procedure  makes 


minimal  assumptions  about  the  data  and  does  not  require  any 
pre-processing,  pre-screening  or  filtering  of  the  data. 

The  need  for  alternative  normalization  techniques  arose 
with  the  realization  that  genes  assumed  as  housekeeping  and 
‘designated’  by  the  manufacturers  as  such  on  arrays  are  not 
reliable  for  accurate  data  normalization.  In  the  first  experiment 
with  HOSE  cells,  investigating  a  set  of  three  cell  lines  with 
close  genetic  origin,  the  ‘designated’  housekeeping  genes 
change  in  a  coordinated  fashion,  and  it  is  likely  that  they 
fulfill  their  role  as  normalizing  genes.  This  result  is  anticipated 
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Fig.  6.  (a)  Correlation  between  LR-  estimated  (x-axis)  and  PC  A-  or  PCA(line)-estimated  (yellow  series  and  black  series,  respectively) 
normalization  factors  for  the  LD.  The  orange  line  indicates  the  identity  line.  The  arrows  point  at  arrays  with  a  large  relative  difference; 
(b)  correlation  plots  of  intensities  of  genes  marked  with  ‘P’  in  arrays  #1  and  #65.  The  normalization  lines  derived  by  the  LR  and  PC A( line) 
estimates  are  indicated  in  orange  and  blue,  respectively. 


since  the  three  cell  lines  were  cultured  under  standard  growth 
conditions  and  the  observed  differences  in  the  global  gene 
expression  profiles  are  related  to  only  a  small  subset  of  genes 
associated  with  the  sequential  transition  of  the  cells  through 
the  process  of  malignant  transformation.  Conversely,  in  the 
second  experiment,  the  ‘designated’  housekeeping  genes 
appear  to  change  differentially  in  response  to  treatment  with 
Fenretinide.  This  is  consistent  with  the  dramatic  biochem¬ 
ical  changes  associated  with  the  process  of  cells  undergoing 
programmed  cell  death  (Querec,  T.D.  etal.,  manuscript  in  pre¬ 
paration).  The  major  alterations  in  the  global  gene  expression 
profile  that  precedes  and  leads  to  the  triggering  of  apoptosis 
affect  the  expression  states  of  most  housekeeping  genes. 

Pre-processing  of  the  data  prior  to  normalization  is  an 
important  issue.  Typical  steps  include  background  correc¬ 
tion,  logarithmic  transformation  and/or  thresholding.  We 
believe  that  the  background  should  be  removed  prior  to  nor¬ 
malization,  so  that  the  normalization  line  goes  through  the 
origin.  Although  we  simulated  gene  intensities,  as  described 
in  the  Materials  and  methods  section,  there  is  no  theoretical 
basis  to  assume  that  real  data  comply  with  this  distribution. 
Log-transformation  has  the  advantage  of  transforming  the 
noise  distributions  approximately  to  Gaussian.  This  property 
can  be  used  for  estimating  the  probabilities  of  differentially 
expressed  genes  (Kerr  et  al .,  2000).  The  PCA-based  normal¬ 
ization  procedure,  however,  is  based  on  identifying  the  genes 
along  the  normalization  line  in  the  dataset  and  is  invariant  to 
prior  transformation.  Moreover,  based  on  ‘noise’ -simulated 
data,  as  well  as  from  the  HOSE  cell  replicates,  it  is  apparent 
that  log-transformation  may  be  detrimental  to  the  analysis  as 


it  increases  the  relative  contribution  of  the  gene-independent 
noise  in  genes  expressed  at  low  levels.  Because  of  these 
adverse  effects,  and  the  fact  that  by  estimating  the  numbers 
of  genes  in  the  segmented  plane  the  PCA(line)  procedure 
allows  low-expressed  genes  to  be  taken  into  consideration, 
we  chose  to  implement  our  normalization  procedure  on  raw 
(untransformed)  data. 

The  described  procedure  is  also  insensitive  with  respect  to 
prefiltering  (thresholding)  of  the  data,  given  that  the  para¬ 
meter  /I  [Equation  (2)]  is  adjusted  appropriately.  In  the  case 
of  ‘thresholded’  data,  /i  =  1.96  will  be  sufficient  to  discrim¬ 
inate  between  the  sought  housekeeping  genes  and  the  rest 
[Fig.  4(c)].  This  /z-value  will  merely  distinguish  the  ‘noise’ 
genes  from  the  signal  ones  in  non-prefiltered  data.  Thus,  a  lar¬ 
ger  [i  [as  in  the  case  shown  in  Fig.  5(c)]  is  required  to  detect 
the  normalizing  genes  sought.  We  therefore  strongly  recom¬ 
mend  exploring  the  characteristics  of  the  angular  histogram 
of  the  data  before  setting  the  appropriate  /t-value. 

The  only  assumption  made  about  the  distribution  of  the 
intensities  of  the  houseskeeping  genes  for  PCA(line)  is  that 
they  are  distributed  along  a  straight  line.  This  assumption 
is  very  sensible  for  single-channel  arrays,  unlike  the  case 
of  the  double-channel  arrays,  where  it  is  known  that  a  non¬ 
linear  dependence  exists  between  the  gene  expression  levels 
among  the  two  channels  (Yang  et  al .,  2002).  Furthermore,  it 
has  been  shown  recently  that  even  for  these  arrays  the  lin¬ 
ear  and  non-linear  normalization  methods  perform  similarly 
(Park  et  al .,  2003).  In  our  experience,  most  of  the  non¬ 
linear  effects  are  due  to  improper  scanning  settings,  which, 
besides  the  unwanted  variations,  produce  saturated  spots  also. 
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We  consider  the  identification  of  the  housekeeping  genes 
with  intensities  within  the  linear  range,  as  proposed  by  the 
PCA(line)  routine,  to  be  a  reliable  and  robust  source  for 
normalization. 

The  linearity  is  the  basis  of  the  stability  of  the  approach  with 
respect  to  the  parameter  p — it  is  sufficient  to  detect  a  small 
subset  of  S  to  identify  uniquely  the  normalization  line.  Con¬ 
versely,  a  larger  set  of  genes  along  this  line  will  not  impede 
the  calculation  of  the  normalization  parameters.  Still,  in  order 
to  obtain  meaningful  histograms  of  the  number  of  genes  in 
each  segment,  we  recommend  that  p  initially  be  selected  to 
contain  on  average  at  least  10  genes  per  segment.  The  con¬ 
dition  for  linearity  naturally  excludes  genes  with  saturated 
expression  levels  and  it  thus  contributes  significantly  to  redu¬ 
cing  the  interference  of  these  typically  large  signals  in  the 
normalization  process. 

Conditions  (1)  and  (2)  (see  Introduction)  are  instrumental 
for  the  successful  performance  of  the  referenced  normaliz¬ 
ation  procedures.  However,  in  single-channel  arrays,  such 
as  the  Affymetrix  platform  and  radiolabeled  filter  arrays, 
it  is  a  common  phenomenon  that  the  detected  number  of 
up-regulated  genes  is  larger  than  the  number  of  the  down- 
regulated  ones.  This  is  due  to  the  fact  that  the  signals  of  genes 
expressed  at  low  levels  and  undergoing  down-regulation  are 
close  to  or  below  the  background  level,  and,  therefore,  their 
change  is  either  undetected  or  deemed  statistically  insignific¬ 
ant.  When  these  conditions  hold,  as  in  the  case  of  the  simulated 
data  in  Figure  3(a),  PCA  will  be  successful  in  determining 
the  normalization  factors  with  the  following  advantages,  as 
compared  with  the  other  referenced  techniques: 

•  It  provides  an  objective  measure  through  the  magnitude 
of  the  first  eigenvalue  of  how  ‘tightly’  the  data  are 
distributed  along  the  first  PC. 

•  It  simultaneously  determines  normalizing  coefficients  for 
the  entire  dataset.  A  common  approach  for  normalization 
of  multiple  experiments  is  to  choose  one  array  as  the 
baseline  and  to  apply  normalization  (Golub  et  al .,  1999). 
In  order  to  avoid  the  lack  of  symmetry  of  this  procedure, 
the  baseline  is  computed  frequently  as  the  average  gene 
expression  profile  (Tusher  et  al .,  2001).  This  is  achieved 
naturally  with  PCA  as  the  first  PC  is  an  approximation  of 
the  ‘average’  array  in  the  dataset. 

•  Viewing  the  entire  set  of  multiple  array  data  simul¬ 
taneously  allows  proper  down-weighing  of  the  ‘noise’ 
genes,  which,  during  individual  comparisons,  may  affect 
strongly  the  calculation  of  the  normalization  coefficients. 

The  advantages  of  PCA  are  underscored  in  the  LD  example, 
where  a  single  PCA  step  applied  to  the  entire  dataset  estimates 
normalization  coefficients  that  are  almost  identical  to  the  ones 
determined  by  the  pair-wise  LR  procedures,  using  only  well 
measured  genes  in  each  pair  [Fig.  6(a)]. 


The  PCA(line)  procedure,  besides  having  the  above  lis¬ 
ted  general  advantages  of  PCA,  can  also  deal  successfully 
with  situations  where  conditions  (1)  and  (2)  do  not  apply.  In 
the  simulated  datasets,  the  PCA(line)  results  are  closest  to 
the  true  values  as  judged  by  the  relative  mean-square  errors 
from  the  three  procedures  tried.  Visual  inspection  of  the 
LR  and  PCA(line)  normalization  lines  in  the  graph  shown 
in  Figure  6(b)  suggests  that  this  is  also  true  for  the  Affy¬ 
metrix  data.  In  addition,  it  eliminates  the  need  for  using  a 
baseline  array,  which,  as  shown  by  Bolstad  et  al.  (2003),  has 
a  clear  disadvantage  relative  to  the  complete  data  methods  for 
normalization  such  as  the  one  proposed  here. 

In  conclusion,  the  proposed  normalization  procedure 
improves  significantly  the  accuracy  and  precision  of  the  meas¬ 
ured  gene  expression  levels.  Such  procedures  will  become 
even  more  relevant  with  further  refinement  and  standardiza¬ 
tion  of  the  microarray  technology. 
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APPENDIX  1:  ALGORITHM  DESCRIPTION 

(1)  Construct  the  data  matrix  D(i\  j),  where 

i  =  1, . . .  ,n(n — total  number  of  genes  on  each  array), 
j  =  1, . . .  ,m(m — total  number  of  arrays  in  the 
dataset). 

(2)  (Optional)  thresholding  of  the  data: 

(2.1)  Set  the  values  in  D  smaller  than  a  given  value 
(e.g.  200  a.u.  for  the  Clontech  data)  to  0. 

(2.2)  Remove  from  D  genes  with  0  intensities  in  at 
least  one  array,  resulting  in  a  new  data  matrix 
D'(n'  x  m),  where  n'  <  n. 

(3)  PCA  of  D  (here  and  in  the  rest  of  the  text  I)  should  be 
substituted  by  D'  in  the  case  of  thresholding,  as  well  as 
n  by  n'). 

(3.1)  Calculate  C — the  covariance  matrix  of  D: 

1  T 

c  = - dtd, 

n  —  1 

where  Dr  denotes  the  transpose  matrix  of  I). 

(3.2)  Calculate  eigenvectors  Q  and  eigenvalues  A  of 
the  covariance  matrix  C,  i.e.: 

CQ  =  QA 

The  rows  in  Q  are  the  PCs  P\ ,  Pi _ _  Pm- 

(3.3)  Calculate  the  scores  R  =  D  PT. 
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(4)  Let  R\  and  Rl2  be  the  scores  of  the  i  -th  gene  along  P\ 
and  Pi. 

(4.1)  Disregard  genes  for  which  Rl2  =  0. 

(4.2)  Calculate  the  angle  <pi,i  =  (in  radi¬ 

ans),  between  Pi  and  the  vector  with  coordinates 
(/?j,  Rl2),  as  follows: 

2n  +  arctan(/?j//?2), 
if  R\  <  0  and  Rl2  >  0, 
arctan(/?/,//?i>)  .  f 

(Pi  =  .  1  -  1 3=1,..., zi. 

if  R\  >  0  and  R2  >  0, 
n  +  arctan(/?'1//?2) 
if  R[  >  0  and  Rl2  <  0, 

(5)  Segment  the  part  of  the  plane  defined  by  the  first  2  PCs 
in  p  partitions. 

(5.1)  Determine  the  segment  0  =  max(^,)  -  min(^/) 

(5.2)  Determine  a  step  8  =  0/p 

(5.3)  Define  the  subset  of  genes  s*  in  each  of  the  p 
segments,  defined  as 

Sk  €  [(£  —  1)5  min(^i),fc<$nnn(^/)], 
k  =  1, . . . ,  p. 

(6)  Determine  the  subset  of  housekeeping  genes  S. 

(6.1)  Determine  the  number  of  genes  Ok  in  each 
subset  s*. 

(6.2)  Estimate  the  mean  A/ (ft ),  and  variance,  V(ft), 
of  the  distribution  of  ft. 

(6.3)  Evaluate  if 

ek  >  M(0k)  +  i. t^/mj 

holds  for  any  k.  p  is  a  cut-off  parameter,  which 
can  be  set  to  1 .96  if  a  normal  distribution  of  ft  is 
assumed  [see  body  of  the  paper.  Equation  (2)]. 

If  none  of  the  segments  satisfies  the  condition  it 
means  that  either  none  of  the  genes  can  serve  as 
a  housekeeping  gene  (S  =  0)  or  all  genes  in  the 
dataset  can  be  assumed  to  be  housekeeping  genes 
(S  =  D).  Then  the  loadings  of  P\  (3.2)  may  be 
used  as  normalizing  factors. 

(6.4)  The  expression  levels  of  the  genes  in  each  array 
should  be  divided  by  these  loadings. 

End  of  the  Procedure 

(6.5)  Let  Z  denote  the  set  of  these  segments  that  satisfy 
the  condition  in  6.3.  If  for  a  certain  qy  £q  e  Z, 
then 

(6.5.1)  If  67+1  £  Z,  then 


(6.5.1. 1)  If  there  are  no  other  q s,  for 
which  (q  e  Z,  then  proceed  as 
in  6.4. 

(6.5. 1.2)  Conversely,  proceed  as  in  6.5. 

(6.5.2)  If  £7+1  €  Z,  then  the  genes  in  these  two 
segments  are  assumed  to  be  housekeep¬ 
ing  genes;  S  =  sq  U  fy+i.  Add  to  S  the 
genes  of  any  consecutive  segments  that 
belong  to  Z. 

(6.5.2. 1)  Apply  PCA  (3.2)  to  the  gene 
expression  levels  in  S.  The 
loadings  of  P\  can  be  used 
as  normalizing  factors.  The 
expression  levels  of  the  genes 
in  each  array  should  be  divided 
by  these  loadings. 

End  of  the  Procedure 

APPENDIX  2:  SIMULATED  DATASET 

Let  gn  be  the  gene  intensity  of  the  i-th  gene  in  the  first 
array  (i  =  1,2, . . .  ,500).  The  corresponding  intensities  in 
the  second  array  in  SD1  were  generated  as  follows. 

gn  =  <712  *  niin[Gfupgn,  Aip]  i  =  1 . 200, 

gn  =  qn  *  max[(*down  8i l » Aiown]  i  =  201, ... ,  350, 
gi 2  =  #12  * gi\  i  =  351,..., 500, 

(A.l) 

where  q\i  =  1.2,  and  the  as  and  fis  are  random  numbers 
within  the  following  intervals: 

<*uP  =  (l,10], 

Aip  —  (£/2»<!?maxL  where  gmax  =  80000, 

<*down  =  (0»  1/10], 

/?down  =  (&min>£/2L  where  gmin  —  0. 

APPENDIX  3:  SIMULATED  DATASET 

Let  gij  be  the  gene  intensity  of  the  i-th  gene  in  the  j- th  array 
(i  =  1,2,...,  500;  j  =  1,2,..., 7).  Equation  (A.l)  describes 
the  generation  of  the  data  in  SD2  (qn  substituted  corres¬ 
pondingly  with  q\j ,  randomly  generated  scaling  parameters 
between  0.3  and  3),  derived  from  the  intensities  of  the  genes 
in  the  first  array,  where  a uP  and  QtJdown  are  consistent  with  a 
simulated  gradual  increase  in  fold  of  changes  between  1 .5  and 
4.5  with  an  increment  of  0.5,  both  for  up-  and  down-regulated 
genes.  Formally, 

«up  =  (1,1+  j  *  step],  .  J  7 

“down  =  (°>  !/0  +  j  *  step)], 

where  step  =  0.5. 
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ABSTRACT 

Animal  models  of  ovarian  cancer  are  crucial  for  understanding  the 
pathogenesis  of  the  disease  and  for  testing  new  treatment  strategies.  A 
model  of  ovarian  carcinogenesis  in  the  rat  was  modified  and  improved  to 
yield  ovarian  preneoplastic  and  neoplastic  lesions  that  pathogenetically 
resemble  human  ovarian  cancer.  A  significantly  lower  dose  (2  to  5  p g  per 
ovary)  of  7,12-dimethylbenz(a)anthracene  (DMBA)  was  applied  to  the  one 
ovary  to  maximally  preserve  its  structural  integrity.  DMBA -induced  mu¬ 
tagenesis  was  additionally  combined  with  repetitive  gonadotropin  hor¬ 
mone  stimulation  to  induce  multiple  cycles  of  active  proliferation  of  the 
ovarian  surface  epithelium.  Animals  were  treated  in  three  arms  of  differ¬ 
ent  doses  of  DMBA  alone  or  followed  by  hormone  administration.  Com¬ 
parison  of  the  DMBA-treated  ovaries  with  the  contralateral  control  or¬ 
gans  revealed  the  presence  of  epithelial  cell  origin  lesions  at 
morphologically  distinct  stages  of  preneoplasia  and  neoplasia.  Their  his- 
topathology  and  path  of  dissemination  to  other  organs  are  very  similar  to 
human  ovarian  cancer.  Hormone  cotreatment  led  to  an  increased  lesion 
severity,  indicating  that  gonadotropins  may  promote  ovarian  cancer  pro¬ 
gression.  Point  mutations  in  the  Tp53  and  Ki-Ras  genes  were  detected  that 
are  also  characteristic  of  human  ovarian  carcinomas.  Additionally,  an 
overexpression  of  estrogen  and  progesterone  receptors  was  observed  in 
preneoplastic  and  early  neoplastic  lesions,  suggesting  a  role  of  these 
receptors  in  ovarian  cancer  development.  These  data  indicate  that  this 
DMBA  animal  model  gives  rise  to  ovarian  lesions  that  closely  resemble 
human  ovarian  cancer  and  it  is  adequate  for  additional  studies  on  the 
mechanisms  of  the  disease  and  its  clinical  management. 

INTRODUCTION 

Ovarian  cancer  is  one  of  the  leading  causes  of  cancer-related  deaths 
among  women  (1,  2).  The  understanding  of  the  molecular  pathogen¬ 
esis  of  ovarian  cancer  has  been  hindered  by  the  lack  of  sufficient 
numbers  of  specimens  at  early-stage  disease  because  of  its  frequent 
diagnosis  at  advanced  stages  (3,  4).  Consequently,  the  existence  of 
identifiable  precursor  lesions  that  ultimately  develop  into  ovarian 
cancer  is  still  debatable  (5,  6). 

More  than  80%  of  ovarian  cancers  originate  in  the  ovarian  surface 
epithelium  (7-12).  Incessant  ovulation,  postmenopausal  increase  of 
gonadotropin  hormone  levels,  chronic  inflammation,  and  environmen¬ 
tal  carcinogens  are  assumed  to  play  key  roles  in  ovarian  oncogenesis 
(13-16). 

Animal  models  that  closely  recapitulate  human  ovarian  cancer  are 
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crucial  for  understanding  its  pathogenesis  and  for  testing  new  treat¬ 
ment  strategies.  A  number  of  models  have  been  developed  to  date  on 
the  basis  of  carcinogen  treatment,  gonadotropin/steroid  hormone  stim¬ 
ulation,  and  genetic  modeling  (for  review,  see  refs.  17,  18).  The  latter 
is  based  on  the  introduction  of  genetic  alterations  through  the  germ 
line  or  conditional  inactivation  of  certain  tumor  suppressor  genes, 
such  as  Tp53  and  pRb  (19),  or  the  ectopic  expression  of  certain 
oncogenes,  or  a  combination  of  both  (20).  Transgenic  models,  how¬ 
ever,  depend  strongly  on  the  specificity  and  timing  of  expression  of 
the  used  promoter  in  the  ovary  and,  more  specifically,  in  the  ovarian 
surface  epithelium,  which  until  recently  was  unavailable.  Further¬ 
more,  most  incorporated  gene  changes  thus  far  are  associated  with 
advanced  human  ovarian  cancer,  and  their  role  in  early-stage  disease 
is  unknown.  Recently,  the  MISRII  promoter,  which  exhibits  a  rela¬ 
tively  restricted  pattern  of  expression,  was  used  to  drive  the  expres¬ 
sion  of  the  SV40  large  T-antigen  in  the  ovarian  surface  epithelium 
(21).  Approximately  50%  of  the  female  mice  bearing  the  MISRII-T- 
antigen  transgene  developed  bilateral,  poorly  differentiated  ovarian 
tumors  by  6  to  13  weeks  of  age.  Similarly,  most  genetic  models 
developed  to  date  are  unable  to  reproduce  the  histopathological  di¬ 
versity  of  human  ovarian  cancer  and  give  rise  to  rapidly  developing, 
advanced-stage  disease  at  very  young  age.  Hence,  although  very 
important  for  understanding  the  role  of  discrete  genes  in  ovarian 
cancer,  these  models  are  inadequate  for  studying  the  preneoplastic  and 
early  neoplastic  stages  of  the  disease  or  for  prevention  studies.  In 
contrast,  the  ovarian  lesions  induced  by  carcinogens  and  hormones  in 
general  display  all  three  stages  of  cancer  development  (initiation, 
promotion,  and  progression).  The  direct  implantation  of  chemical 
carcinogens,  such  as  7,12-dimethylbenz(a)anthracene  (DMBA)  in  the 
rat  ovary  (22-24),  leads  to  the  induction  of  ovarian  tumors  at  an 
incidence  of  ~37%.  These  include  adenocarcinomas,  as  well  as 
stroma  and  mesothelial  tumors  (22,  23,  25).  There  is,  however,  lack  of 
information  regarding  the  nature  and  sequence  of  events  elicited  by 
DMBA  and  leading  to  ovarian  cancer  development. 

To  improve  its  usage  and  physiologic  relevance  to  the  human 
disease,  the  DMBA  model  of  ovarian  cancer  was  modified  (a)  by 
significantly  decreasing  the  DMBA  dose,  thereby  preserving  maxi¬ 
mally  the  integrity  of  the  organ  and  ( b )  by  incorporating  multiple 
gonadotropin  hormone  treatments,  thus  introducing  an  additional  risk 
factor  associated  with  human  ovarian  cancer,  known  also  to  induce 
hyperovulation  and  enhanced  mitogenesis  of  the  ovarian  surface  ep¬ 
ithelium  (26).  Characterization  of  this  modified  animal  model  re¬ 
vealed  the  appearance  of  early  and  advanced  lesions  with  a  progres¬ 
sive  nature  that  range  from  nonneoplastic  to  preneoplastic  to 
malignant.  Their  histopathology  and  path  of  dissemination  strongly 
resemble  human  ovarian  cancer. 

MATERIALS  AND  METHODS 
Animals  and  In  vivo  Treatments 

Six-week-old  virgin  Sprague  Dawley  rats  (Taconic  Farms,  Germantown, 
NY)  were  used  following  NIH  and  Fox  Chase  Cancer  Center  animal  care 
guidelines.  DMBA  mixed  with  beeswax  was  directly  applied  to  the  right  ovary 
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of  120  animals.  The  left  ovaries  were  treated  with  beeswax  only.  Animals  were 
treated  in  three  study  arms  (Supplemental  Table  1):  60  animals  (arm  1)  with 
2.5  /Ltg  of  DMBA  and  60  animals  (arms  2  and  3)  with  5  /xg  of  DMBA.  The 
latter  was  subdivided  in  2  X  30  and  subjected  to  six  cycles  of  treatment  with 
pregnant  mare’s  serum  gonadotropin  (Sigma,  St.  Louis,  MO)  and  human 
chorionic  gonadotropin  (Ferring  Pharmaceuticals,  Los  Angeles,  CA),  once 
every  2  weeks,  starting  at  2  months  after  DMBA  application  (arm  3)  or  with 
corresponding  vehicle  at  the  same  regimen  (arm  2).  Pregnant  mare’s  serum 
gonadotropin  (in  sterile  saline:  0.9%  NaCl;  Abbott  Laboratories,  Chicago,  IL) 
and  human  chorionic  gonadotropin  (in  bacteriostatic  water)  were  administered 
i.p.  and  i.m.,  respectively,  each  at  a  dose  of  40  IU  per  animal. 

DMBA  Suture  Preparation 

Three  or  1 .0  g  of  beeswax  (Sigma)  was  melted  in  a  sterile  Petri  dish  on  a 
sandbath  at  135°C  in  a  chemical  fume  hood  under  amber  light.  One  gram  of 
DMBA  (Sigma)  was  added  to  the  melted  beeswax  and  mixed  until  melted. 
Uncoated  silk  sutures  (7-0  USP;  United  States  Surgical,  North  Haven,  CT) 
were  dipped  into  the  melted  mixture  for  2  to  3  minutes.  Sutures  were  air-dried 
and  wrapped  in  a  sterilized  aluminum  sheet.  Beeswax-control  sutures  were 
prepared  similarly.  Sutures  were  stored  at  4°C  for  up  to  7  days  before  surgery. 
The  average  DMBA  weight  per  cm  suture  was  ~ 8  or  — 15  /xg  for  a  1:3  or  1:1 
mixture  of  DMBA:beeswax,  respectively,  corresponding  to  a  dose  of  ~2.5  and 
~5  /xg,  respectively,  for  ~3-mm  implanted  suture. 

DMBA  Application  to  the  Ovary 

Six-week-old  virgin  rats  were  anesthetized  by  inhalation  of  halothane, 
followed  by  i.p.  injection  of  1  mL/Kg  body  weight  xylazine  (20  mg/mL), 
Acepromazine  maleate  (10  mg/mL)  and  Ketamine-HCl  (100  mg/mL)  mixed  in 
a  ratio  of  1:2:3,  respectively.  The  rat  flanks  were  shaved  and  washed  with 
iodine  solution  and  70%  etomidate.  Sterile  conditions  were  used  throughout 
the  surgical  procedure.  A  transverse,  M.5-cm  mid-lumbar  incision  was  made 
in  the  right  flank  of  the  animal,  ~5  mm  ventral  to  the  lumbar  muscles.  The  fat 
pad  with  the  attached  ovary  was  gently  pulled  out  of  the  cavity  with  blunt-end 
forceps,  held  by  the  fallopian  tube,  and,  under  amber  light,  a  DMBA/beeswax- 
suture  was  applied  across  the  ovary,  contralaterally  to  the  fallopian  tube/fibria. 
The  suture  ends  were  cut  flush  with  the  surface  of  the  bursa.  The  organ  was 
placed  back  into  the  cavity  and  the  muscle  wall  was  sutured  with  sterile 
absorbable  sutures  (4-0  USP;  Fisher  Scientific,  Pittsburgh,  PA).  The  skin  was 
closed  with  wound  clips.  Similarly,  a  beeswax- impregnated  suture  was  im¬ 
planted  into  the  left  ovary.  The  animals  were  observed  until  awaken  and  daily 
for  the  next  10  to  14  days.  The  wound  clips  were  removed  7  to  10  days  after 
surgery. 

Tissue  Preparation  and  Immunohistochemistry 

Upon  animal  sacrifice,  the  ovaries  and  other  organs  (fallopian  tubes,  uterus, 
and  mammary  glands)  were  harvested,  formalin  fixed  (18  hours),  and  paraffin 
embedded.  Five-micron  serial  sections  from  different  areas  of  each  organ  were 
stained  with  H&E  and  subjected  to  histopathological  examination.  Adjacent, 
unstained  5-/xm  sections  were  subjected  to  immunohistochemistry  analysis  for 
the  expression  of  several  protein  markers  (Supplemental  Table  3)  with  reagents 
provided  with  corresponding  antibody  kits  and  following  standard  procedures 
(27). 


Mutation  Analysis 

Extraction  of  Genomic  DNA  from  Ovarian  Lesions.  Six-micron  sections 
obtained  from  formalin-fixed,  paraffin-embedded  tissue  blocks  and  containing 
corresponding  ovarian  lesions  were  microdissected  (PixCell  II  LCM  system, 
Arcturus  Engineering,  Inc.,  Mountain  View,  CA;  3-ms  pulse,  75-mW  power, 
and  15-  to  30-jim  laser-spot  size)  to  select  ~2  to  3  X  104  cells.  Genomic  DNA 
was  extracted  with  the  PicoPure  DNA  extraction  kit  (Arcturus  Engineering, 
Inc.).  Cells  were  suspended  in  50  /xL  proteinase  K  buffer  [100  mmol/L 
Tris-HCl  (pH  7.6),  0.5%  SDS,  1  mmol/L  CaCl2,  and  100  /xg/mL  oyster 
glycogen]  and  digested  for  7  days  at  55°C  with  daily  addition  of  50  /xg  of 
proteinase  K.  Ten  microliters  of  25%  Tris-buffered  Chelex  solution  were 
added  and  heated  at  95°C  for  10  minutes.  Cell  lysates  were  extracted  twice 
with  phenol:chloroform:isoamyl  alcohol  (25:24:1)  with  the  addition  of 
NH4C3H202  and  once  with  chloroform.  DNA  was  precipitated  with  2  volumes 
of  100%  ice-cold  etomidate,  1  /xL  of  glycogen  (20  /xg//xL)  and  2  /xL  of  4  n 
NaCl  at  -20°C  overnight.  Pellets  were  collected  by  centrifugation  at 
13,000  x  g  for  15  minutes,  washed  with  70%  etomidate,  recentrifuged,  dried, 
and  resuspended  in  25  /xL  of  10  mmol/L  Tris-HCl  (pH  8.0).  DNA  concentra¬ 
tion  was  determined  spectrophotometrically  (ND-1000;  NanoDrop  Technolo¬ 
gies,  Inc.,  Wilmington,  DE). 

PCR  Amplification,  Restriction  Digest,  and  Direct  Sequencing.  Individ¬ 
ual  gene  exons  were  subjected  to  PCR  amplification  with  corresponding 
specific  oligonucleotide  primers  (Supplemental  Table  2),  followed  by  diag¬ 
nostic  restriction  digest  and  for  Ki-Ras  and  Tp53  also  by  direct  sequencing  at 
the  Fox  Chase  Cancer  Center  sequencing  facility.  Digested  and  undigested 
PCR  products  were  resolved  in  a  4%  Tris-acetate  agarose  gel  containing 
ethidium  bromide  (5  /xg/mL;  Sigma)  for  UV-light  detection.  In  cases  where 
more  than  one  band  was  visible,  the  band  with  the  corresponding  expected  size 
was  purified  from  the  gel  with  Gel  DNA  extraction  kit  (Qiagen,  Valencia,  CA). 
Genomic  DNA  obtained  from  the  ovary  of  an  untreated  female  rat  was  used  as 
control.  Sequence  analysis  was  carried  out  with  Accelrys  SeqWeb  V.2  for  the 
Wisconsin  GCG  sequence  analysis  package  V.10. 

Histopathology  and  Statistical  Analysis 

Three  5-/xm  H&E-stained  tissue  sections  obtained  from  different  areas  of 
each  ovary  (one  section  each  at  1 00  /xm  from  the  two  ends  and  one  from  the 
middle  of  the  organ)  were  subjected  to  histopathology  evaluation.  Calls  were 
made  for  presence  or  absence  of  significant  lesions.  The  latter  were  subdivided 
into  three  groups:  nonneoplastic,  putative  preneoplastic,  and  tumor  (Table  1). 

Generalized  estimating  equations  in  the  context  of  logistic  regression  were 
used  to  model  the  probability  of  developing  a  lesion  of  a  specific  severity  as 
a  function  of  treatment  and  time  on  study.  The  outcome  measure  is  a  binary 
indicator  of  whether  a  significant  lesion  was  observed  in  a  given  ovary  at  time 
of  sacrifice.  The  correlation  structure  was  modeled  by  assuming  that  two  data 
points  were  independent  if  and  only  if  they  were  obtained  from  different 
animals  {i.e.,  the  left  and  right  ovary  assessments  are  correlated  if  they  came 
from  the  same  animal  and  are  independent  otherwise).  All  significance  tests 
were  based  on  two-sided  type  3  score  statistics.  The  left  and  right  ovaries  of 
each  animal  were  assigned  an  ordinal  score  representing  the  maximum  severity 
of  any  lesion  observed  at  time  of  sacrifice.  The  lesion  score  range  was  as 
follows:  1  (no  significant  lesion),  2  (nonneoplastic),  3  (preneoplastic),  and  4 
(tumor). 


Table  1  Incidence  and  severity  of  DM  BA-induced  ovarian  lesions 


Arm  1 

Arm  2 

Arm  3 

Control  ovaries 

DMBA 

DMBA 

DMBA 

Severity  of  lesions 

(2.5  /ig) 

(5.0  /ig) 

(5.0  /Ag)+hormome 

Arm  1 

Arm  2 

Arm  3 

Total  ovaries 

No  lesions  cnt.  (%) 

35  (59.32) 

12(40.00) 

14  (48.28) 

52  (88.13) 

23  (76.67) 

21  (72.41) 

157  (66.52) 

Nonneoplastic  lesions  cnt.  (%)  * 

11  (18.64) 

5(16.66) 

1  (3.45) 

5  (8.47) 

4(13.33) 

2  (6.89) 

28(11.86) 

Putative  preneoplastic  lesions  cnt.  (%)  t 

12(20.34) 

13(43.33) 

1 1  (37.93) 

2  (3.38) 

2  (6.67) 

6  (20.69) 

46(19.49) 

Neoplastic  lesions  cnt.  (%) 

1  (1.69) 

0(0.00) 

3  (10.34) 

0 

1  (3.33) 

0 

5(2.12) 

Total  animals/Total  ovaries  cnt.  (%) 

59  (25.00) 

30(12.71) 

29(12.29) 

59  (25.00) 

30(12.71) 

29(12.29) 

236(100) 

*  Chronic  inflammation;  foreign  body  granuloma;  prominent  corpora  lutea;  suture  granuloma;  salpingitis. 

t  Epithelial  hyperplastic  lesions:  ovarian  surface  epithelium  or  bursal  flat  hyperplasia  (either  pseudostratification  or  real  stratified  hyperplasia);  ovarian  surface  epithelium  or  bursal 
papillae  or  papillomatosis;  inclusion  cysts;  endosalpingiosis.  All  these  lesions  can  present  with  or  without  atypia. 

Abbreviation:  cnt.,  number  of  lesions,  ovaries,  or  animals. 
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RESULTS 

Ovarian  Preneoplasia  and  Neoplasia  Induced  in  Rats 
with  DMBA 

Female  Sprague  Dawley  rats  were  subjected  to  local  application  of 
DMBA/beeswax  to  their  right  ovaries  in  three  treatment  arms.  Their 
left  ovaries  were  treated  as  internal  controls  by  application  of  beeswax 
alone.  To  determine  the  sequence  of  histologic  and  molecular  changes 
elicited  by  DMBA  in  the  ovary,  subgroups  of  animals  were  sacrificed 
at  various  time  points,  up  to  12  months  (Supplemental  Table  1). 
Overall,  an  apparent  decrease  in  volume  was  evident  in  the  DMBA- 
treated  ovaries  in  arms  1  and  2.  Relative  to  the  control  ovaries,  the 
histologic  and  physiologic  integrity  of  the  treated  organs  was  well 
maintained,  with  the  exception  of  a  small  reduction  in  the  rate  of 
follicular  development  and  corpora  lutea  formation  (Fig.  1A).  In  arm 
3,  as  a  result  of  the  stimulatory  effect  of  the  administered  gonado¬ 
tropin  hormones,  the  reduction  in  volume  of  the  DMBA-treated  ova¬ 
ries  was  less  apparent.  An  average  4  to  5-fold  larger  number  of 
developing  follicles  and  corpora  lutea  was  observed  in  both  ovaries, 
as  compared  with  the  ovaries  of  animals  in  arms  1  and  2  (data  not 
shown).  No  other  histologic  changes  were  observed  during  the  first  4 
to  5  months  after  DMBA  treatment  in  the  ovaries.  At  5  to  6  months 
posttreatment  and  persisting  to  the  end  of  the  experiment,  a  number  of 
different  types  of  lesions  were  observed  (Table  1):  {a)  nonneoplastic 
lesions  (chronic  inflammation,  foreign  body  granuloma,  prominent 
corpora  lutea ,  suture  granuloma,  and  salpingitis)  were  found  in  both 
DMBA-treated  and  control  ovaries  and  at  a  similar  frequency;  and  ( b ) 
the  appearance  of  lesions  of  a  putative  preneoplastic  nature  and  with 
a  progressive  character  was  observed  predominantly  in  the  DMBA- 


Fig.  1.  Putative  ovarian  preneoplastic  epithelial  lesions  induced  by  DMBA.  A,  left 
panel:  beeswax-  (L.Ov)  and  DMBA-treated  (R.OV)  whole  ovaries;  middle  and  right 
panels:  H&E-stained  sections  of  control  (L.Ov)  and  DMBA-treated  (R.Ov)  ovaries.  B .  left 
panel:  ovarian  surface  epithelial  and  bursal  epithelial  hyperplasia  (arrows)’,  right  panel: 
higher  magnification  of  portions  containing  papillary  bursal  epithelial  (top  panel)  and  flat 
columnar  or  pseudostratified  ovarian  surface  epithelial  hyperplasia  (bottom  panel).  C,  left 
panel:  inclusion  cyst  with  papillae.  Note  two  cross-sections  of  papillae  (arrows)  inside  the 
epithelial  gland-like  inclusion  cyst.  Right  panel:  advanced  epithelial  papillary  hyperplasia. 
Note  several  cross  sections  of  papillary  structures  on  the  ovarian  surface  (arrows).  (H&E 
staining;  bar  scale:  100  pm;  S-suture). 


Fig.  2.  Neoplastic  lesions  induced  by  DMBA  in  the  ovary.  A ,  noninvasive  exophytic 
growth  of  papillary  structures  forming  a  serous  low  malignant  potential  tumor  on  the 
ovarian  surface.  Note  that  the  panel  to  the  right  shows  little  or  no  nuclear  atypia  of  the 
tumor  cells.  B,  invasive  serous  adenocarcinoma.  The  low  magnification  panel  (left)  shows 
invasive  gland-like  neoplastic  structures  invading  the  ovarian  cortex.  The  contiguous 
panel  shows  at  higher  magnification  the  atypical  tumor  cells.  C.  squamous-cell  carcinoma 
invading  the  ovary.  The  contiguous  panel  shows  at  higher  magnification  the  atypical 
squamous  carcinoma  cells.  D,  undifferentiated  carcinoma.  The  contiguous  panel  shows  at 
higher  magnification  the  atypical  poorly  to  undifferentiated  tumor  cells.  (H&E  staining; 
bar  scale:  100  p, m,  low  and  high  magnification  at  the  left  and  right,  respectively). 


treated  ovaries  (Fig.  1,  B  and  C).  These  represent  proliferative  epi¬ 
thelial  lesions,  present  either  along  the  surface  of  the  organ  or  in  the 
ovarian  cortex.  Other  preneoplastic  lesions  represent  inclusion  cysts 
or  simple  serous  microcysts;  other  cortical  lesions  surrounded  by 
ovarian  stroma  and  characterized  by  the  presence  of  several  gland-like 
structures,  usually  covered  by  a  simple  serous  cuboidal  epithelium, 
and  some  resembling  fallopian  tube  epithelial  differentiation  (endosal- 
pingiosis).  A  few  preneoplastic  lesions  exhibit  cellular  atypia  and  are 
classified  as  epithelial  hyperplastic  lesions  with  dysplasia.  None  of  the 
hyperplastic  epithelial  lesions  are  invasive;  they  are  well  circum¬ 
scribed,  small,  and  with  low  mitotic  rate.  These  characteristic  features 
separate  them  easily  from  either  borderline  ovarian  tumors  (also 
known  as  serous  tumors  of  low  malignant  potential)  or  invasive 
adenocarcinomas  and  bona  fide  ovarian  tumors,  detected  in  arms  1 
and  3  only.  A  tumor  highly  reminiscent  of  human  serous  low  malig¬ 
nant  potential  tumor  was  detected  at  12  months  after  DMBA  treatment 
in  arm  1  (Fig.  2 A),  an  invasive  serous  adenocarcinoma — at  6  months 
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in  arm  3  (Fig.  2 B),  a  squamous-cell  carcinoma — at  9  months,  arm  3 
(Fig.  2 C),  and  an  undifferentiated  carcinoma — at  1 1  months,  arm  3 
(Fig.  2D). 

Statistics 

The  cumulative  incidence  of  preneoplastic  lesions  and  bona  fide 
tumors  in  the  DMBA-treated  ovaries  in  arm  1  was  22%,  whereas  in 
arms  2  and  3  it  was  2-fold  higher  (43.33  versus  44.82%,  respectively; 
Table  1).  However,  both  the  preneoplastic  lesions  and  the  bona  fide 
tumors  in  arm  3  displayed  a  more  complex,  advanced  histology 
relative  to  those  in  arms  1  and  2.  When  all  three  types  of  lesions  were 
considered  together  in  each  of  the  three  arms,  time  to  sacrifice  was  not 
a  significant  predictor  of  lesion  severity  ( P  =  0.356).  Thus,  the 
probability  that  an  animal  bore  a  lesion  of  a  specific  degree  of  severity 
was  not  observed  to  depend  on  how  long  the  animal  was  allowed  to 
survive  before  sacrifice.  The  level  of  DMBA  treatment,  however,  had 
a  significant  effect  on  lesion  severity  ( P  <  0.0001).  Specifically,  the 
control  ovaries  had  a  significantly  lower  incidence  of  lesions  and  at  a 
lower  severity  than  the  DMBA  ovaries  in  arms  1,  2  and  3,  respectively 
(P  <  0.05).  Furthermore,  the  cumulative  incidence  of  preneoplastic 
lesions  and  tumors  together  was  significantly  higher  in  arms  2  and  3 
as  compared  with  arm  1  ( P  <  0.05);  however,  there  was  no  significant 
difference  in  the  incidence  of  these  lesions  between  arms  2  and  3 
(P  =  0.73). 

Immunohistochemical  Characterization  of  Ovarian  Lesions 

Epithelial  Cell  Origin.  The  epithelial  cell  origin  of  the  preneo¬ 
plastic  lesions  and  carcinomas  was  confirmed  by  their  positive  anti- 
cytokeratin  immunostaining,  characteristic  of  most  types  of  epithelial 
cells  (Fig.  3),  and  the  negative  anti-vimentin  immunostaining  that 
detects  a  variety  of  mesenchymal  cells  (data  not  shown). 

Expression  of  Estrogen  (ER)  and  Progesterone  (PgR)  Recep¬ 
tors.  To  determine  whether  ER  and  PgR  play  a  role  during  ovarian 
cancer  development  in  this  model,  their  expression  status  was  exam¬ 
ined  by  immunohistochemistry  for  ER-a  and  PgR  (A/B).  Although 
the  expression  of  both  receptors  is  low  to  undetectable  in  morpholog¬ 
ically  normal  ovarian  surface  epithelium  cells,  all  tested  preneoplastic 
lesions  and  the  serous  low  malignant  potential  tumor  are  strongly 
positive  for  both  ER-a  and  PgR  (Fig.  4,  A  and  B,  left  and  middle 
panels ,  respectively).  The  expression  of  both  receptors,  however,  is 
either  markedly  decreased  or  undetected  in  the  invasive  carcinomas 
(Fig.  4,  C  and  D,  left  and  middle  panels ,  respectively). 

Expression  of  Tp53.  Anti-Tp53  immunohistochemistry  was  car¬ 
ried  out  to  determine  whether  Tp53  gene  mutations  leading  to  loss  of 
function  and  accumulation  of  the  protein  are  also  induced  during 
ovarian  cancer  development  by  DMBA.  A  strong  positive  anti-Tp53 
immunostaining  was  detected  in  the  two  invasive  and  the  squamous 
cell  carcinomas  (Fig.  4,  C  and  D,  right  panel ,  and  data  not  shown)  but 
not  in  the  preneoplastic  lesions  (Fig.  4 A,  right  panel)  or  the  serous  low 
malignant  potential  tumor  (Fig.  4 B,  right  panel). 

Mutation  Analysis 

Tp53  Gene.  To  examine  the  mutational  status  of  Tp53  during  ovarian 
cancer  development  in  this  model,  genomic  DNA  was  extracted  from 
microdissected  normal-appearing  ovarian  surface  epithelium,  preneoplas¬ 
tic  lesions,  tumors,  and  a  control  untreated  ovary.  Tp53  exons  4  to  8  were 
PCR-amplified  from  purified  genomic  DNA  samples  with  corresponding 
oligonucleotide  primers  (Supplemental  Table  2).  PCR  products  were 
subjected  to  bi-directional  sequencing  after  extraction  from  agarose  gels. 
Individual  Tp53  mutations  were  detected  in  four  of  the  examined  pre- 
neoplastic  lesions  and  in  all  tumors  (Table  2). 


Fig.  3.  Cytokeratin-positive  immunostain  in  preneoplastic  and  neoplastic  lesions  in¬ 
duced  by  DMBA  demonstrate  their  epithelial  origin.  Positive  cytokeratin  immunostaining 
of  ovarian  surface  epithelium  flat  stratified  (A)  and  papillary  hyperplasia  ( B ).  serous  low 
malignant  potential  tumor  (C),  invasive  serous  adenocarcinoma  (D),  and  undifferentiated 
carcinoma  (£).  (Hematoxylin  counterstaining;  bar  scale:  100  /xm). 

Ki-Ras  Gene.  To  determine  whether  activating  mutations  of  Ki- 
Ras  in  codons  12,  13,  and  61  are  associated  with  ovarian  cancer  in  this 
model,  genomic  DNA,  purified  as  for  Tp53  analysis,  was  used  for 
PCR  amplification  with  corresponding  oligonucleotide  primers  (Sup¬ 
plemental  Table  2).  PCR  products  were  subjected  to  diagnostic  re¬ 
striction  digest  with  BSS  SI  (for  codon  61)  and  bi-directional  sequenc¬ 
ing  after  purification  from  agarose  gels.  Only  mutation  of  codon  61 
(CAA— »CAC;  protein  Gin— *His)  was  identified  in  this  rat  model  and 
was  present  in  4  of  the  12  examined  preneoplastic  lesions  (Table  2) 
and  in  the  invasive  adenocarcinoma. 

PgR.  The  presence  or  absence  of  an  activating  mutation  of  PgRs  at 
codon  660  was  also  examined  in  extracted  genomic  DNA,  with  PCR 
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Fig.  4.  ER-a,  PgR,  and  Tp53  expression  in 
putative  preneoplastic  and  neoplastic  ovarian  le¬ 
sions  induced  by  DMBA.  Left  panel:  anti-ER-a; 
middle  panel:  anti-PgR;  and  right  panel:  anti-Tp53 
immunostaining  of  (A)  DMBA-treated  ovaries  con¬ 
taining  epithelial  flat  and  papillary  hyperplasia,  ( B ) 
serous  low  malignant  potential  tumor,  (C)  invasive 
serous  adenocarcinoma,  and  ( D )  undifferentiated 
carcinoma.  Note  that  the  ER-a  and  PgR  immuno- 
stains  are  markedly  decreased  in  C  and  D  and  that 
Tp53  immunostain  is  markedly  decreased  or  absent 
in  A  and  B.  (Hematoxylin  counterstaining;  bar 
scale:  100  pm). 


D 


amplification  with  corresponding  oligonucleotide  primers  and  diag¬ 
nostic  restriction  digest  with  Tsp  RI  (Supplemental  Table  2).  Such 
mutation  was  not  detected  in  any  of  the  examined  lesions. 

DISCUSSION 

This  study  attempted  to  additionally  improve  the  DMBA-rat  model 
of  ovarian  oncogenesis  and  characterize  the  distinct  stages  of  preneo¬ 
plasia  and  neoplasia.  The  contribution  of  gonadotropin  hormones  to 
this  process  was  also  demonstrated.  DMBA  treatment  of  the  ovary 
induces  putative  preneoplastic  lesions  of  epithelial  cell  origin  and  with 


progressive  histology  that  are  assumed  to  represent  precursors  of 
ovarian  cancer  clonal  development.  Given  the  difficulties  in  obtaining 
a  consensus  on  what  human  ovarian  preneoplastic  or  precursor  lesions 
are,  an  attempt  was  made  to  classify  the  putative  precursor  lesions  of 
the  rat  ovary  with  terminology  used  for  human  ovarian  epithelial 
lesions.  The  lesions  observed  in  the  rat  ovary  represent  proliferative 
epithelial  lesions  of  variable  degrees  of  differentiation,  without  or 
with  dysplasia,  and  localized  along  the  ovarian  surface  and  cortex. 
Some  of  the  lesions,  especially  those  seen  on  the  surface,  are  similar 
to  isolated  papillae  or  diffuse  papillomatosis  seen  in  human  ovaries.  In 
addition,  there  are  occasionally  other  ovarian  surface  epithelium- 


Tablc  2  Mutations  detected  in  the  Ki-Ras  and  Tp53  genes  in  DMBA-induced  preneoplastic  and  neoplastic  ovarian  lesions  in  the  rat 


Ki-Ras 
Codon  61 
CAA-CAC 

Tp53  mutations 

Rat  codon 

Human 

Mutation: 

Mutation: 

Prevalence  in  human 

Protein 

Type  of  lesion  (cnt.) 

(cnt.) 

(Exon) 

codon 

DNA 

protein 

ovarian  cancer 

accumulation 

OSE/Bursal  epithelial  papillae  (3) 

Yes  (2) 

224  (6) 

226 

GTG— *GCG 

Val— >Ala 

ND 

ND 

OSE/Bursal  epithelial  papillae  with  dysplasia  (2) 

Yes  (2) 

ND 

N/A 

N/A 

N/A 

N/A 

ND 

Papillomatosis  (3) 

ND 

207  (6) 

209 

AGG-CGG 

Silent  (Arg) 

ND 

ND 

Inclusion  cysts  with  pappilae  (4) 

ND 

209  (6) 

211 

ACT— ♦ATT 

Thr— lie 

Yes:  0.39% 

ND 

178  (5) 

180 

GAA— *GGA 

Glu— *Gly 

ND 

ND 

Low  malignant  potential  (LMP)  tumor 

ND 

255  (7) 

257 

Deletion  ATC 

lie 

Yes:  0.39% 

ND 

Squamous  cell  carcinoma 

ND 

151  (5) 

153 

CCT— ♦TCT 

Pro— *Ser 

Yes:  0.1% 

Yes 

Cystadenoma  and  invasive  adenocarcinoma 

Yes 

218(6) 

220 

CAG-CGG 

Gin— ♦  Arg 

Yes:  2.4% 

Yes 

Undifferentiated  carcinoma  (invasive) 

ND 

173  (5) 

175 

CGC-CTT 

Arg— ♦Leu 

Yes:  6.8% 

other  GYN  cancer:  17.6% 

Yes 

Abbreviations:  ND,  not  detected;  N/A,  not  applicable;  GYN,  gynecological;  cnt.,  number  of  lesions  from  independent  ovaries  tested  for  mutation. 
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derived  structures  that  were  previously  described  in  humans,  i.e., 
inclusion  cysts  or  simple  serous  microcysts.  None  of  the  observed 
hyperplastic  epithelial  lesions  are  invasive  and  are  quite  distinct  from 
either  serous  low  malignant  potential  ovarian  tumors  or  invasive 
carcinomas.  The  development  of  the  putative  precursor  lesions  gen¬ 
erally  precedes  the  emergence  of  bona  fide  tumors,  which  also  display 
variable  degrees  of  differentiation  and  progression,  ranging  from  early 
tumors  to  high-grade  malignant,  invasive  carcinomas.  In  addition  to 
the  tumors  detected  in  this  study,  a  bilateral  invasive  carcinoma  with 
clear-cell  histology  was  detected  within  12  months  in  an  animal 
whose  ovaries  were  treated  bilaterally  with  ~5  pg  of  DMBA  (not  part 
of  the  three  study  arms).  This  advanced  tumor  displayed  widespread 
dissemination  to  i.p.  organs,  production  of  ascites,  and  metastatic 
hemorrhagic  foci  in  the  lungs  (data  not  shown). 

Statistically,  the  appearance  of  lesions  of  any  given  severity  did  not 
depend  significantly  on  the  time  of  sacrifice  after  DMBA  treatment; 
however,  escalation  of  carcinogen  dose  combined  with  hormonal 
stimulation  increased  significantly  the  severity  of  the  detected  lesions. 
The  cumulative  incidence  of  preneoplastic  lesions  and  tumors  was 
also  equivalently  increased  significantly  at  the  higher  DMBA  dose  in 
arms  2  and  3.  Although  the  lesion  incidence  in  arms  2  and  3  was 
similar,  the  lesions  detected  in  arm  3  were  more  advanced  than  those 
in  arm  2,  including  bona  fide  tumors  that  were  not  observed  altogether 
in  arm  2.  This  data  demonstrates  the  strong  contribution  of  gonado¬ 
tropin  hormones  to  the  neoplastic  progression  of  the  ovarian  lesions, 
perhaps  due  to  increased  ovarian  surface  epithelium  cell  proliferation 
and  their  effects  on  the  underlying  stroma.  As  demonstrated  earlier, 
treatment  of  rats  with  pregnant  mare’s  serum  gonadotropin  and/or 
human  chorionic  gonadotropin,  in  the  presence  or  absence  of  surgical 
scarring  to  the  ovary,  leads  to  a  5  to  10-fold  increase  in  the  rate  of 
ovarian  surface  epithelium  cell  proliferation  (26). 

The  observed  DMBA-induced  reduction  in  ovarian  volume,  accom¬ 
panied  by  decreased  follicular  growth  and  corpora  lutea  formation,  is 
in  good  agreement  with  previously  published  data  (28).  The  apparent 
differences  in  the  observed  low-dose  response  and  persistence  of 
ovarian  hypoplasia  in  this  study  may  be  due  to  the  slow-release  form 
of  DMBA  applied  directly  to  the  ovary.  Although  not  yet  well 
understood  in  its  full  complexity,  a  suggested  mechanism  underlying 
the  observed  ovarian  hypoplasia  and  cellular  destruction  is  that  DNA- 
adduct  formation  by  DMBA  metabolites  leads  to  Tp53-mediated 
inhibition  of  DNA  synthesis,  cell  growth  arrest,  and  caspase-depend- 
ent  or  independent  apoptosis  (29-31).  Hence,  DMBA-induced  muta¬ 
tions)  that  disrupt  Tp53  function  may  allow  evasion  of  affected 
ovarian  surface  epithelium  cells  and  contribute  to  their  malignant 
transformation. 

Nonneoplastic  and  a  small  number  of  preneoplastic  lesions,  as  well 
as  a  small  granulosa  cell  tumor  were  also  detected  in  control  ovaries. 
To  determine  whether  such  lesions  occur  spontaneously  in  this  rat 
strain,  20  nontreated  animals  were  divided  in  two  groups  of  10  and 
maintained  to  the  age  of  8  and  14  months,  respectively.  Examination 
of  their  ovaries  revealed  no  significant  lesions,  which  strongly  sug¬ 
gests  that  the  lesions  observed  in  the  control  ovaries  may  be  a 
consequence  of  surgical  scarring  and  chronic  inflammation,  and/or 
carcinogen  carryover  from  the  contralateral  ovary.  This  data  indicates 
that  chronic  inflammation,  a  known  risk  factor  of  ovarian  cancer,  may 
contribute  to  the  DMBA-induced  neoplastic  process,  either  directly  on 
epithelial  cells  through  the  action  of  secreted  inflammatory  cytokines 
and  growth  factors  or  indirectly  through  their  effect  on  the  adjacent 
stroma. 

This  study  has  additionally  demonstrated  that  specific  mutations  in 
the  Tp53  and  Ki-Ras  genes,  which  are  among  the  most  frequent 
mutations  found  in  human  ovarian  tumors,  are  also  associated  with 
ovarian  cancer  induced  by  DMBA.  TP53  mutations  are  found  in  35  to 


40%  of  human  ovarian  tumors  (32-34).  The  identified  rat  Tp53 
mutations  of  codons  173  and  218  correspond  to  human  codons  175 
and  220,  respectively,  which  are  among  the  most  frequent  in  human 
ovarian  cancer  (6.8%  and  2.4,  respectively).3  Interestingly,  both  mu¬ 
tations  lead  to  a  characteristic  accumulation  of  Tp53  protein.  Activat¬ 
ing  mutations  of  Ki-Ras ,  including  codon  61  detected  in  multiple 
DMBA-induced  preneoplastic  lesions  and  in  one  carcinoma,  have 
been  associated  with  ~20%  of  human  ovarian  tumors:  of  them,  ~60% 
are  found  in  mucinous  and  ~20%  in  serous  carcinomas  (35,  36).  The 
relatively  high  frequency  of  Ki-Ras  mutations  in  the  preneoplastic 
lesions  and,  especially,  in  the  ones  with  dysplasia  provides  a  strong 
indication  of  their  clonal  (i.e.,  neoplastic)  nature.  It  additionally  ar¬ 
gues  that  Ki-Ras  activation,  either  through  mutation  or  by  aberrant 
upstream  signals,  is  very  important  during  ovarian  cancer  develop¬ 
ment.  Finally,  a  significant  overexpression  of  the  ER-a  and  PgR 
proteins  was  also  demonstrated  in  the  preneoplastic  lesions  and  the 
serous  low  malignant  potential  tumor.  However,  the  expression  of  the 
two  receptors  was  markedly  decreased  or  absent  in  the  advanced 
carcinomas.  The  importance  of  this  finding,  in  view  of  the  existing 
controversy  over  the  expression  status  of  ER-a  and  PgR  in  human 
ovarian  cancer  (37,  38),  mandates  additional  investigation.  Further¬ 
more,  the  Val660Leu  polymorphism  that  frequently  occurs  in  exon  4  of 
PgRs  has  been  suggested  to  have  an  association  with  human  ovarian 
cancer  characteristics  and  with  overall  ovarian  cancer  risk  (39). 
Population-based  studies,  however,  have  demonstrated  that  no  such 
association  exists  (40,  41).  Lack  of  this  PgR  mutation  in  the  examined 
ovarian  lesions  is  additional  evidence  to  the  consistency  of  the  DMBA 
rat  ovarian  cancer  model  with  the  human  disease. 

DMBA  is  a  pluripotent  carcinogen,  which,  through  the  formation  of 
DNA  adducts,  induces  initiating  point  mutations  that  alter  the  expres¬ 
sion  and/or  activity  of  a  number  of  oncogenes  and  tumor  suppressor 
genes  (42-45).  Although  DMBA  itself  is  not  a  known  environmental 
carcinogen  associated  with  ovarian  cancer,  it  shares  similar  mutagenic 
mechanisms  with  other  polycyclic  aromatic  hydrocarbons  whose 
abundance  is  relatively  high  in  air  pollutants  and  in  tobacco  smoke 
and  which  have  been  implicated  in  human  cancer  development  (46, 
47).  Hence,  the  observed  effect  of  DMBA  in  the  ovary  may  be 
representative  of  the  effect  that  such  carcinogens  have  in  the  ovaries 
of  affected  women. 

Here,  we  have  demonstrated  that  direct  application  of  a  low  dose  of 
DMBA  in  the  rat  ovary,  alone  or  combined  with  multiple  cycles  of 
gonadotropin  administration,  elicits  a  neoplastic  process  that  affects 
mostly  the  ovarian  surface  epithelium  and  leads  to  the  progressive 
development  of  putative  epithelial  cell  preneoplasia,  serous  low  ma¬ 
lignant  potential  tumors,  and  invasive  carcinomas.  The  similarity  in 
histology  and  path  of  dissemination  of  the  DMBA-induced  rat  ovarian 
carcinomas  with  those  in  the  human,  as  well  as  the  presence  of  gene 
mutations  that  are  common  in  human  ovarian  cancer,  demonstrate  the 
validity  of  this  animal  model  for  additional  delineation  of  the  mech¬ 
anisms  underlying  ovarian  tumorigenesis.  Finally,  DMBA-induced 
ovarian  oncogenesis  in  the  rat  could  be  used  to  preclinically  test  new 
agents  for  the  prevention  and/or  therapy  of  the  disease. 
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